The project that we are working on is https://www.kaggle.com/c/home-credit-default-risk/
It is primarily due to poor or lack of credit histories of the applicants that the vast majority of loan applicants are denied. Thus, these applicants turn to untrustworthy lenders for their financial support, and they risk being taken advantage of, mostly with unreasonably high rates of interest. In order to address this issue “Home Credit” which is a 26 years old (founded in 1997) lending agency, spread across 8 countries, strives to broaden financial inclusion for the unbanked population by providing easy, fast and safe borrowing. In this project, we hope to predict a borrower's ability to repay a loan using historical loan application data and training them on predictive machine learning models, Naive Bayes, Logistic Regression, Random Forest, Stochastic GD. The models will be evaluated on ROC_AUC, Accuracy, Confusion Matrix, Log loss and, F-1 Score.
In the previous, by performing EDA we have identified the best features for our prediction model and upon performing Logistic Regression a Train accuracy of 92% was achieved. The kaggle submission private and public score are 0.715 and 0.722 respectively.
In phase 3 after performing feature engineering and extraction new features Random forest and stochastic GD models were implemented and a kaggle public and private score for Random forest are 0.712 and 0.71 respectively. A training accuracy of 92% was achiebed for both random forest and stochastic GD.
We have implemented MLP and a training accuracy of 92.4% was achieved with best parameters as MLP_alpha: 0.01 and MLP_hidden_layer_size:500.
To predict whether the applicant will repay the loan using the historical data we will be performing the following algorithms on the data to achieve most accurate results:
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Phase 2:
Data describing & Data preprocessing: This task involves describing the data using pandas dataframes to understand the data and to eliminate any missing/null values from data and clean the data.
EDA & Baseline Modeling: This task involves analysis of categorical and numerical features and their correlation and visual EDa of the data. With the help of EDA Feature extraction is done and Baseline modeling is performed.
Phase 3:
Feature Engineering: This task involes creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset.
Hyperparameter tuning: This task involves hyperparameter tuning using the grid search method.
Testing & Reporting: Test the ML models on the test data and evaluate them on metrics.
Home Credit uses various kinds of data and we have been provided with 9 datasets which will help us analyze and classify clients based on risk of non- repayment. The application test and train files represent the data that will be used to train and test, and it gives data about loan applicants during the time of filling the application. The bureau.csv has data about previous loans taken from other financial institutes and those which were reported to the Credit bureau and contains those many rows as the number of loans provided to the client. Bureau_balance file has data about client’s balances taken monthly for previous loans those reported to Credit bureau. POS_CASH_balance has a client's previous cash loans, POS balance snapshots taken monthly where each row consists of each month’s history for previous loans. Credit_card_balance file consists of balance snapshots taken monthly of the client’s previous credit cards and each row shows every month history of previous credit cards in Home Credit. previous_applications show all past applications of applicants for Home Credit loans. installments_payments show past loans history of repayment for loan amount disbursed where each row shows either a payment or missed payment that was made. HomeCredit_columns_description file has data describing the columns in other dataset files. The application train dataset has 122 columns, 307511 rows. We have 121 features where 105 are numerical and 16 are categorical. We aim to examine the qualification of individuals with no credit history in getting a home loan. Sometimes, it happens that an individual might not have a credit history but still requires a home loan. In such cases, loan approval on the basis of credit rating is not a viable option. Our project will look at all the other factors for an individual. From monthly income to previous loan applications, we will cover all the financial aspects of the individual and classify them as either ‘no risk’ individuals or ‘credit risk’ individuals. In this phase we tackle tasks like understanding various features of the raw data, performing EDA and pre-processing data, dataset split to test, train and validation, training baseline random forest and logistic regression models and analysing their performance.
We perform the following tasks:
Data description using pandas dataframe: pandas.Dataframe is a two-dimensional, size-mutable, potentially heterogenous tabular data structure that also contains labeled axes (rows and columns). Methods used to decribe data from the dataframes are:
Elimination of null/missing data: To determine the type of values/datatype|featurespresent in the dataset values as zeroes and to check for any kind of null values/missing values in our data. After checking for missing values, we remove more than 20% of the missing data. Similarly, we will remove the columns having more than 80% rows that contain values as zeros only.
Data Segregation: We segregate the data into Categorical & Numerical Variables.
Data joining/merging: We Join the features that have high correlation by odentifying them from EDA.
Best feature extraction: Extracting the top important features for the model pipelining and hyperparameter tuning.
# Importing all the necessary python libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
#Importing the database files
import os
import shutil
from google.colab import drive
drive.mount('/content/drive')
src_dir = '/content/drive/MyDrive/IU/AML/home-credit-default-risk'
dst_dir = '/content'
for i in os.listdir(src_dir):
shutil.copy2(src= (src_dir + '/' + i), dst= dst_dir)
Mounted at /content/drive
Exploratory data analysis is the crucial process of doing preliminary analyses on data in order to find patterns, identify anomalies, test hypotheses, and double-check assumptions with the aid of summary statistics and graphical representations. EDA helps with a better understanding of the variables in the data collection and their relationships, and is usually used to investigate what data might disclose beyond the formal modeling or hypothesis testing assignment. It can also assist in determining the suitability of the statistical methods you are considering using for data analysis.
df_train = pd.read_csv('application_train.csv')
df_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
# Describe data
df_train.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
#Train Data information
df_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
# Test data information
df_test = pd.read_csv('application_test.csv')
df_test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB
#Describe test data
df_test.describe()
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
percent_missing = df_train.isnull().sum() * 100 / len(df_train)
missing_value_df = pd.DataFrame({'column_name': df_train.columns, 'percent_missing': percent_missing}).sort_values('percent_missing', ascending=False)
display(missing_value_df.to_markdown)
<bound method DataFrame.to_markdown of column_name percent_missing COMMONAREA_MEDI COMMONAREA_MEDI 69.872297 COMMONAREA_AVG COMMONAREA_AVG 69.872297 COMMONAREA_MODE COMMONAREA_MODE 69.872297 NONLIVINGAPARTMENTS_MODE NONLIVINGAPARTMENTS_MODE 69.432963 NONLIVINGAPARTMENTS_AVG NONLIVINGAPARTMENTS_AVG 69.432963 ... ... ... NAME_HOUSING_TYPE NAME_HOUSING_TYPE 0.000000 NAME_FAMILY_STATUS NAME_FAMILY_STATUS 0.000000 NAME_EDUCATION_TYPE NAME_EDUCATION_TYPE 0.000000 NAME_INCOME_TYPE NAME_INCOME_TYPE 0.000000 SK_ID_CURR SK_ID_CURR 0.000000 [122 rows x 2 columns]>
By creating new features from the existing ones (and then discarding the original features), Feature Extraction attempts to reduce the number of features in a dataset. The new reduced set of features will be able to summarize much of the information that was contained in the original set of features. Thus, an abridged version of the original features can be created by combining them.
In our analysis of the data, we found that there are many missing values. Columns with more than 20% of missing values were removed. Our team checked the columns for the distribution of 0's and removed the columns with 85% of rows with only 0's. In addition, we divided the data into numerical and categorical data. The numerical data was handled by creating an intermediate imputer pipeline in which the missing values were replaced with the mean of the data, while the missing values in categorical missing data were handled by encoding the data based upon OHE (One Hot Encoding) and replacing the missing values with the mode of the columns.
def missing_percentage(df):
missing_values = df.isnull().sum(axis=0)*100/len(df)
return missing_values.sort_values(ascending=False)
import seaborn as sns
def missing(df,n):
new_df = missing_percentage(df).reset_index()
categorical_ = []
new_df.columns = ['index','flag']
final_df = []
for row in new_df.itertuples():
try:
final_df.append([row.index,row.flag,df[row.index].median(),df[row.index].mean(), df[row.index].nunique()])
except:
final_df.append([row.index,row.flag,df[row.index].mode(),'NA',df[row.index].nunique()])
columns = ['col_name','percentage_missing','median/Mode','mean','no_of_unique_values']
temp = pd.DataFrame(final_df,columns=columns)
return temp[temp['percentage_missing']>n]
sns.set(style="ticks", context="talk")
plt.style.use("dark_background")
df_20 = missing(df_train,20)
df_analysis = missing(df_train,0)
print("Columns with missing data percentage more than 20%")
display(df_20)
print("\nAnalysis of training data")
display(df_analysis)
Columns with missing data percentage more than 20%
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | COMMONAREA_MEDI | 69.872297 | 0.0208 | 0.044595 | 3202 |
| 1 | COMMONAREA_AVG | 69.872297 | 0.0211 | 0.044621 | 3181 |
| 2 | COMMONAREA_MODE | 69.872297 | 0.019 | 0.042553 | 3128 |
| 3 | NONLIVINGAPARTMENTS_MODE | 69.432963 | 0.0 | 0.008076 | 167 |
| 4 | NONLIVINGAPARTMENTS_AVG | 69.432963 | 0.0 | 0.008809 | 386 |
| 5 | NONLIVINGAPARTMENTS_MEDI | 69.432963 | 0.0 | 0.008651 | 214 |
| 6 | FONDKAPREMONT_MODE | 68.386172 | 0 reg oper account dtype: object | NA | 4 |
| 7 | LIVINGAPARTMENTS_MODE | 68.354953 | 0.0771 | 0.105645 | 736 |
| 8 | LIVINGAPARTMENTS_AVG | 68.354953 | 0.0756 | 0.100775 | 1868 |
| 9 | LIVINGAPARTMENTS_MEDI | 68.354953 | 0.0761 | 0.101954 | 1097 |
| 10 | FLOORSMIN_AVG | 67.848630 | 0.2083 | 0.231894 | 305 |
| 11 | FLOORSMIN_MODE | 67.848630 | 0.2083 | 0.228058 | 25 |
| 12 | FLOORSMIN_MEDI | 67.848630 | 0.2083 | 0.231625 | 47 |
| 13 | YEARS_BUILD_MEDI | 66.497784 | 0.7585 | 0.755746 | 151 |
| 14 | YEARS_BUILD_MODE | 66.497784 | 0.7648 | 0.759637 | 154 |
| 15 | YEARS_BUILD_AVG | 66.497784 | 0.7552 | 0.752471 | 149 |
| 16 | OWN_CAR_AGE | 65.990810 | 9.0 | 12.061091 | 62 |
| 17 | LANDAREA_MEDI | 59.376738 | 0.0487 | 0.067169 | 3560 |
| 18 | LANDAREA_MODE | 59.376738 | 0.0458 | 0.064958 | 3563 |
| 19 | LANDAREA_AVG | 59.376738 | 0.0481 | 0.066333 | 3527 |
| 20 | BASEMENTAREA_MEDI | 58.515956 | 0.0758 | 0.087955 | 3772 |
| 21 | BASEMENTAREA_AVG | 58.515956 | 0.0763 | 0.088442 | 3780 |
| 22 | BASEMENTAREA_MODE | 58.515956 | 0.0746 | 0.087543 | 3841 |
| 23 | EXT_SOURCE_1 | 56.381073 | 0.505998 | 0.50213 | 114584 |
| 24 | NONLIVINGAREA_MODE | 55.179164 | 0.0011 | 0.027022 | 3327 |
| 25 | NONLIVINGAREA_AVG | 55.179164 | 0.0036 | 0.028358 | 3290 |
| 26 | NONLIVINGAREA_MEDI | 55.179164 | 0.0031 | 0.028236 | 3323 |
| 27 | ELEVATORS_MEDI | 53.295980 | 0.0 | 0.078078 | 46 |
| 28 | ELEVATORS_AVG | 53.295980 | 0.0 | 0.078942 | 257 |
| 29 | ELEVATORS_MODE | 53.295980 | 0.0 | 0.07449 | 26 |
| 30 | WALLSMATERIAL_MODE | 50.840783 | 0 Panel dtype: object | NA | 7 |
| 31 | APARTMENTS_MEDI | 50.749729 | 0.0864 | 0.11785 | 1148 |
| 32 | APARTMENTS_AVG | 50.749729 | 0.0876 | 0.11744 | 2339 |
| 33 | APARTMENTS_MODE | 50.749729 | 0.084 | 0.114231 | 760 |
| 34 | ENTRANCES_MEDI | 50.348768 | 0.1379 | 0.149213 | 46 |
| 35 | ENTRANCES_AVG | 50.348768 | 0.1379 | 0.149725 | 285 |
| 36 | ENTRANCES_MODE | 50.348768 | 0.1379 | 0.145193 | 30 |
| 37 | LIVINGAREA_AVG | 50.193326 | 0.0745 | 0.107399 | 5199 |
| 38 | LIVINGAREA_MODE | 50.193326 | 0.0731 | 0.105975 | 5301 |
| 39 | LIVINGAREA_MEDI | 50.193326 | 0.0749 | 0.108607 | 5281 |
| 40 | HOUSETYPE_MODE | 50.176091 | 0 block of flats dtype: object | NA | 3 |
| 41 | FLOORSMAX_MODE | 49.760822 | 0.1667 | 0.222315 | 25 |
| 42 | FLOORSMAX_MEDI | 49.760822 | 0.1667 | 0.225897 | 49 |
| 43 | FLOORSMAX_AVG | 49.760822 | 0.1667 | 0.226282 | 403 |
| 44 | YEARS_BEGINEXPLUATATION_MODE | 48.781019 | 0.9816 | 0.977065 | 221 |
| 45 | YEARS_BEGINEXPLUATATION_MEDI | 48.781019 | 0.9816 | 0.977752 | 245 |
| 46 | YEARS_BEGINEXPLUATATION_AVG | 48.781019 | 0.9816 | 0.977735 | 285 |
| 47 | TOTALAREA_MODE | 48.268517 | 0.0688 | 0.102547 | 5116 |
| 48 | EMERGENCYSTATE_MODE | 47.398304 | 0 No dtype: object | NA | 2 |
| 49 | OCCUPATION_TYPE | 31.345545 | 0 Laborers dtype: object | NA | 18 |
Analysis of training data
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | COMMONAREA_MEDI | 69.872297 | 0.0208 | 0.044595 | 3202 |
| 1 | COMMONAREA_AVG | 69.872297 | 0.0211 | 0.044621 | 3181 |
| 2 | COMMONAREA_MODE | 69.872297 | 0.019 | 0.042553 | 3128 |
| 3 | NONLIVINGAPARTMENTS_MODE | 69.432963 | 0.0 | 0.008076 | 167 |
| 4 | NONLIVINGAPARTMENTS_AVG | 69.432963 | 0.0 | 0.008809 | 386 |
| ... | ... | ... | ... | ... | ... |
| 62 | EXT_SOURCE_2 | 0.214626 | 0.565961 | 0.514393 | 119831 |
| 63 | AMT_GOODS_PRICE | 0.090403 | 450000.0 | 538396.207429 | 1002 |
| 64 | AMT_ANNUITY | 0.003902 | 24903.0 | 27108.573909 | 13672 |
| 65 | CNT_FAM_MEMBERS | 0.000650 | 2.0 | 2.152665 | 17 |
| 66 | DAYS_LAST_PHONE_CHANGE | 0.000325 | -757.0 | -962.858788 | 3773 |
67 rows × 5 columns
plt.figure(figsize=[18, 25])
plt.title('A line plot showing the percentage missing of each column')
plt.grid(color='green')
plt.stackplot(df_analysis["percentage_missing"], df_analysis["col_name"], alpha=0.5)
[<matplotlib.collections.PolyCollection at 0x7fe2ad031d90>]
plt.figure(figsize=[40, 15])
sns.barplot(data=df_analysis, x=df_analysis['col_name'], y= df_analysis['percentage_missing'])
plt.xticks(rotation=70, fontsize=25)
plt.tight_layout()
plt.xlabel("Features", fontdict={'fontsize':40})
plt.ylabel("Missinng values (%)", fontdict={'fontsize':40})
Text(322.5, 0.5, 'Missinng values (%)')
df_zero = pd.DataFrame()
columns = []
percentage =[]
for col in df_train.columns:
if col == 'TARGET':
continue
count = (df_train[col] == 0).sum()
columns.append(col)
percentage.append(count/len(df_train[col]))
df_zero['Column'] = columns
df_zero['Percentage'] = percentage
per = 85/100
df_zero = df_zero[df_zero['Percentage']>per]
more_than_85 = df_zero
more_than_85
| Column | Percentage | |
|---|---|---|
| 26 | FLAG_EMAIL | 0.943280 |
| 33 | REG_REGION_NOT_LIVE_REGION | 0.984856 |
| 34 | REG_REGION_NOT_WORK_REGION | 0.949231 |
| 35 | LIVE_REGION_NOT_WORK_REGION | 0.959341 |
| 36 | REG_CITY_NOT_LIVE_CITY | 0.921827 |
| 91 | DEF_30_CNT_SOCIAL_CIRCLE | 0.882323 |
| 93 | DEF_60_CNT_SOCIAL_CIRCLE | 0.912881 |
| 95 | FLAG_DOCUMENT_2 | 0.999958 |
| 97 | FLAG_DOCUMENT_4 | 0.999919 |
| 98 | FLAG_DOCUMENT_5 | 0.984885 |
| 99 | FLAG_DOCUMENT_6 | 0.911945 |
| 100 | FLAG_DOCUMENT_7 | 0.999808 |
| 101 | FLAG_DOCUMENT_8 | 0.918624 |
| 102 | FLAG_DOCUMENT_9 | 0.996104 |
| 103 | FLAG_DOCUMENT_10 | 0.999977 |
| 104 | FLAG_DOCUMENT_11 | 0.996088 |
| 105 | FLAG_DOCUMENT_12 | 0.999993 |
| 106 | FLAG_DOCUMENT_13 | 0.996475 |
| 107 | FLAG_DOCUMENT_14 | 0.997064 |
| 108 | FLAG_DOCUMENT_15 | 0.998790 |
| 109 | FLAG_DOCUMENT_16 | 0.990072 |
| 110 | FLAG_DOCUMENT_17 | 0.999733 |
| 111 | FLAG_DOCUMENT_18 | 0.991870 |
| 112 | FLAG_DOCUMENT_19 | 0.999405 |
| 113 | FLAG_DOCUMENT_20 | 0.999493 |
| 114 | FLAG_DOCUMENT_21 | 0.999665 |
| 115 | AMT_REQ_CREDIT_BUREAU_HOUR | 0.859696 |
| 116 | AMT_REQ_CREDIT_BUREAU_DAY | 0.860142 |
df_train.drop(columns = more_than_85['Column'],inplace = True)
df_train =df_train[df_train['NAME_FAMILY_STATUS']!='Unknown']
df_train =df_train[df_train['CODE_GENDER']!='XNA']
df_train =df_train[df_train['NAME_INCOME_TYPE']!='Maternity leave']
df_numerical= df_train.select_dtypes(exclude='object')
df_numerical['TARGET'] = df_train['TARGET']
df_categorical= df_train.select_dtypes(include='object')
<ipython-input-17-0c726e26b403>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_numerical['TARGET'] = df_train['TARGET']
df_numerical.describe()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307500.000000 | 307500.000000 | 307500.000000 | 3.075000e+05 | 3.075000e+05 | 307488.000000 | 3.072240e+05 | 307500.000000 | 307500.000000 | 307500.000000 | ... | 137824.000000 | 159074.000000 | 306479.000000 | 306479.000000 | 307499.000000 | 307500.000000 | 265986.000000 | 265986.000000 | 265986.000000 | 265986.000000 |
| mean | 278181.087798 | 0.080725 | 0.417034 | 1.687971e+05 | 5.990259e+05 | 27108.477604 | 5.383943e+05 | 0.020868 | -16037.069246 | 63817.429333 | ... | 0.028237 | 0.102548 | 1.422202 | 1.405248 | -962.865681 | 0.710049 | 0.034363 | 0.267390 | 0.265476 | 1.899961 |
| std | 102789.822017 | 0.272413 | 0.722108 | 2.371263e+05 | 4.024936e+05 | 14493.600189 | 3.694459e+05 | 0.013831 | 4363.988872 | 141277.730537 | ... | 0.070168 | 0.107464 | 2.400947 | 2.379760 | 826.813694 | 0.453740 | 0.204687 | 0.915997 | 0.794062 | 1.869288 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -4292.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189146.750000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.041200 | 0.000000 | 0.000000 | -1570.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.500000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.003100 | 0.068800 | 0.000000 | 0.000000 | -757.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367143.250000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.026600 | 0.127600 | 2.000000 | 2.000000 | -274.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 348.000000 | 344.000000 | 0.000000 | 1.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 78 columns
correlation = df_numerical.corr()['TARGET'].sort_values(ascending = False).reset_index()
correlation.columns = ['col_name','Correlation']
after_correlation = correlation[abs(correlation['Correlation'])>0.03]
after_correlation
| col_name | Correlation | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | DAYS_BIRTH | 0.078236 |
| 2 | REGION_RATING_CLIENT_W_CITY | 0.060875 |
| 3 | REGION_RATING_CLIENT | 0.058882 |
| 4 | DAYS_LAST_PHONE_CHANGE | 0.055228 |
| 5 | DAYS_ID_PUBLISH | 0.051455 |
| 6 | REG_CITY_NOT_WORK_CITY | 0.050981 |
| 7 | FLAG_EMP_PHONE | 0.045978 |
| 8 | FLAG_DOCUMENT_3 | 0.044371 |
| 9 | DAYS_REGISTRATION | 0.041950 |
| 10 | OWN_CAR_AGE | 0.037625 |
| 11 | LIVE_CITY_NOT_WORK_CITY | 0.032500 |
| 58 | AMT_CREDIT | -0.030390 |
| 59 | LIVINGAREA_MODE | -0.030688 |
| 60 | ELEVATORS_MODE | -0.032132 |
| 61 | TOTALAREA_MODE | -0.032600 |
| 62 | FLOORSMIN_MODE | -0.032700 |
| 63 | LIVINGAREA_MEDI | -0.032743 |
| 64 | LIVINGAREA_AVG | -0.033001 |
| 65 | FLOORSMIN_MEDI | -0.033397 |
| 66 | FLOORSMIN_AVG | -0.033616 |
| 67 | ELEVATORS_MEDI | -0.033864 |
| 68 | ELEVATORS_AVG | -0.034200 |
| 69 | REGION_POPULATION_RELATIVE | -0.037223 |
| 70 | AMT_GOODS_PRICE | -0.039671 |
| 71 | FLOORSMAX_MODE | -0.043228 |
| 72 | FLOORSMAX_MEDI | -0.043770 |
| 73 | FLOORSMAX_AVG | -0.044005 |
| 74 | DAYS_EMPLOYED | -0.044927 |
| 75 | EXT_SOURCE_1 | -0.155333 |
| 76 | EXT_SOURCE_2 | -0.160451 |
| 77 | EXT_SOURCE_3 | -0.178926 |
df_temp = missing(df_categorical,0)
df_temp
| col_name | percentage_missing | median/Mode | mean | no_of_unique_values | |
|---|---|---|---|---|---|
| 0 | FONDKAPREMONT_MODE | 68.386667 | 0 reg oper account dtype: object | NA | 4 |
| 1 | WALLSMATERIAL_MODE | 50.840976 | 0 Panel dtype: object | NA | 7 |
| 2 | HOUSETYPE_MODE | 50.176260 | 0 block of flats dtype: object | NA | 3 |
| 3 | EMERGENCYSTATE_MODE | 47.398374 | 0 No dtype: object | NA | 2 |
| 4 | OCCUPATION_TYPE | 31.345691 | 0 Laborers dtype: object | NA | 18 |
| 5 | NAME_TYPE_SUITE | 0.419512 | 0 Unaccompanied dtype: object | NA | 7 |
column_remove = ['FONDKAPREMONT_MODE','WALLSMATERIAL_MODE','HOUSETYPE_MODE','EMERGENCYSTATE_MODE','OCCUPATION_TYPE']
df_categorical.drop(columns = column_remove,inplace=True)
/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy return super().drop(
df_numerical.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 307500 entries, 0 to 307510 Data columns (total 78 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307500 non-null int64 1 TARGET 307500 non-null int64 2 CNT_CHILDREN 307500 non-null int64 3 AMT_INCOME_TOTAL 307500 non-null float64 4 AMT_CREDIT 307500 non-null float64 5 AMT_ANNUITY 307488 non-null float64 6 AMT_GOODS_PRICE 307224 non-null float64 7 REGION_POPULATION_RELATIVE 307500 non-null float64 8 DAYS_BIRTH 307500 non-null int64 9 DAYS_EMPLOYED 307500 non-null int64 10 DAYS_REGISTRATION 307500 non-null float64 11 DAYS_ID_PUBLISH 307500 non-null int64 12 OWN_CAR_AGE 104579 non-null float64 13 FLAG_MOBIL 307500 non-null int64 14 FLAG_EMP_PHONE 307500 non-null int64 15 FLAG_WORK_PHONE 307500 non-null int64 16 FLAG_CONT_MOBILE 307500 non-null int64 17 FLAG_PHONE 307500 non-null int64 18 CNT_FAM_MEMBERS 307500 non-null float64 19 REGION_RATING_CLIENT 307500 non-null int64 20 REGION_RATING_CLIENT_W_CITY 307500 non-null int64 21 HOUR_APPR_PROCESS_START 307500 non-null int64 22 REG_CITY_NOT_WORK_CITY 307500 non-null int64 23 LIVE_CITY_NOT_WORK_CITY 307500 non-null int64 24 EXT_SOURCE_1 134126 non-null float64 25 EXT_SOURCE_2 306840 non-null float64 26 EXT_SOURCE_3 246541 non-null float64 27 APARTMENTS_AVG 151444 non-null float64 28 BASEMENTAREA_AVG 127562 non-null float64 29 YEARS_BEGINEXPLUATATION_AVG 157498 non-null float64 30 YEARS_BUILD_AVG 103019 non-null float64 31 COMMONAREA_AVG 92644 non-null float64 32 ELEVATORS_AVG 143614 non-null float64 33 ENTRANCES_AVG 152677 non-null float64 34 FLOORSMAX_AVG 154485 non-null float64 35 FLOORSMIN_AVG 98866 non-null float64 36 LANDAREA_AVG 124917 non-null float64 37 LIVINGAPARTMENTS_AVG 97309 non-null float64 38 LIVINGAREA_AVG 153155 non-null float64 39 NONLIVINGAPARTMENTS_AVG 93994 non-null float64 40 NONLIVINGAREA_AVG 137824 non-null float64 41 APARTMENTS_MODE 151444 non-null float64 42 BASEMENTAREA_MODE 127562 non-null float64 43 YEARS_BEGINEXPLUATATION_MODE 157498 non-null float64 44 YEARS_BUILD_MODE 103019 non-null float64 45 COMMONAREA_MODE 92644 non-null float64 46 ELEVATORS_MODE 143614 non-null float64 47 ENTRANCES_MODE 152677 non-null float64 48 FLOORSMAX_MODE 154485 non-null float64 49 FLOORSMIN_MODE 98866 non-null float64 50 LANDAREA_MODE 124917 non-null float64 51 LIVINGAPARTMENTS_MODE 97309 non-null float64 52 LIVINGAREA_MODE 153155 non-null float64 53 NONLIVINGAPARTMENTS_MODE 93994 non-null float64 54 NONLIVINGAREA_MODE 137824 non-null float64 55 APARTMENTS_MEDI 151444 non-null float64 56 BASEMENTAREA_MEDI 127562 non-null float64 57 YEARS_BEGINEXPLUATATION_MEDI 157498 non-null float64 58 YEARS_BUILD_MEDI 103019 non-null float64 59 COMMONAREA_MEDI 92644 non-null float64 60 ELEVATORS_MEDI 143614 non-null float64 61 ENTRANCES_MEDI 152677 non-null float64 62 FLOORSMAX_MEDI 154485 non-null float64 63 FLOORSMIN_MEDI 98866 non-null float64 64 LANDAREA_MEDI 124917 non-null float64 65 LIVINGAPARTMENTS_MEDI 97309 non-null float64 66 LIVINGAREA_MEDI 153155 non-null float64 67 NONLIVINGAPARTMENTS_MEDI 93994 non-null float64 68 NONLIVINGAREA_MEDI 137824 non-null float64 69 TOTALAREA_MODE 159074 non-null float64 70 OBS_30_CNT_SOCIAL_CIRCLE 306479 non-null float64 71 OBS_60_CNT_SOCIAL_CIRCLE 306479 non-null float64 72 DAYS_LAST_PHONE_CHANGE 307499 non-null float64 73 FLAG_DOCUMENT_3 307500 non-null int64 74 AMT_REQ_CREDIT_BUREAU_WEEK 265986 non-null float64 75 AMT_REQ_CREDIT_BUREAU_MON 265986 non-null float64 76 AMT_REQ_CREDIT_BUREAU_QRT 265986 non-null float64 77 AMT_REQ_CREDIT_BUREAU_YEAR 265986 non-null float64 dtypes: float64(61), int64(17) memory usage: 185.3 MB
df_train
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Stone, brick | No | 2.0 | 2.0 | -1134.0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Block | No | 1.0 | 1.0 | -828.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | NaN | NaN | 0.0 | 0.0 | -815.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | 2.0 | 2.0 | -617.0 | 1 | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | NaN | NaN | 0.0 | 0.0 | -1106.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | Cash loans | M | N | N | 0 | 157500.0 | 254700.0 | 27558.0 | ... | Stone, brick | No | 0.0 | 0.0 | -273.0 | 0 | NaN | NaN | NaN | NaN |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.0 | 269550.0 | 12001.5 | ... | Stone, brick | No | 0.0 | 0.0 | 0.0 | 1 | NaN | NaN | NaN | NaN |
| 307508 | 456253 | 0 | Cash loans | F | N | Y | 0 | 153000.0 | 677664.0 | 29979.0 | ... | Panel | No | 6.0 | 6.0 | -1909.0 | 1 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 456254 | 1 | Cash loans | F | N | Y | 0 | 171000.0 | 370107.0 | 20205.0 | ... | Stone, brick | No | 0.0 | 0.0 | -322.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 456255 | 0 | Cash loans | F | N | N | 0 | 157500.0 | 675000.0 | 49117.5 | ... | Panel | No | 0.0 | 0.0 | -787.0 | 1 | 0.0 | 2.0 | 0.0 | 1.0 |
307500 rows × 94 columns
df_temp = df_train[:1000]
df_temp
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Stone, brick | No | 2.0 | 2.0 | -1134.0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Block | No | 1.0 | 1.0 | -828.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | NaN | NaN | 0.0 | 0.0 | -815.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | 2.0 | 2.0 | -617.0 | 1 | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | NaN | NaN | 0.0 | 0.0 | -1106.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 101152 | 0 | Cash loans | F | N | N | 0 | 112500.0 | 495985.5 | 17946.0 | ... | Wooden | No | 4.0 | 4.0 | -1912.0 | 0 | 0.0 | 0.0 | 0.0 | 5.0 |
| 996 | 101153 | 0 | Cash loans | F | N | Y | 0 | 225000.0 | 1113840.0 | 57001.5 | ... | Wooden | No | 0.0 | 0.0 | -536.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 997 | 101154 | 0 | Cash loans | F | Y | Y | 0 | 144000.0 | 517536.0 | 28206.0 | ... | NaN | NaN | 0.0 | 0.0 | -2340.0 | 1 | 0.0 | 0.0 | 0.0 | 3.0 |
| 998 | 101155 | 0 | Cash loans | M | N | Y | 0 | 315000.0 | 1288350.0 | 37800.0 | ... | Panel | No | 0.0 | 0.0 | -631.0 | 1 | 0.0 | 1.0 | 1.0 | 1.0 |
| 999 | 101156 | 0 | Cash loans | M | Y | Y | 2 | 180000.0 | 679500.0 | 27076.5 | ... | NaN | NaN | 0.0 | 0.0 | -1743.0 | 1 | 0.0 | 0.0 | 0.0 | 3.0 |
1000 rows × 94 columns
df_temp['CODE_GENDER'] = df_temp['CODE_GENDER'].replace(['M'], [0])
df_temp['CODE_GENDER'] = df_temp['CODE_GENDER'].replace(['F'], [1])
df_temp
<ipython-input-25-1788b0763298>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_temp['CODE_GENDER'] = df_temp['CODE_GENDER'].replace(['M'], [0]) <ipython-input-25-1788b0763298>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_temp['CODE_GENDER'] = df_temp['CODE_GENDER'].replace(['F'], [1])
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | 0 | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Stone, brick | No | 2.0 | 2.0 | -1134.0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | 1 | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Block | No | 1.0 | 1.0 | -828.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | 0 | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | NaN | NaN | 0.0 | 0.0 | -815.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | 1 | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | 2.0 | 2.0 | -617.0 | 1 | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | 0 | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | NaN | NaN | 0.0 | 0.0 | -1106.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 995 | 101152 | 0 | Cash loans | 1 | N | N | 0 | 112500.0 | 495985.5 | 17946.0 | ... | Wooden | No | 4.0 | 4.0 | -1912.0 | 0 | 0.0 | 0.0 | 0.0 | 5.0 |
| 996 | 101153 | 0 | Cash loans | 1 | N | Y | 0 | 225000.0 | 1113840.0 | 57001.5 | ... | Wooden | No | 0.0 | 0.0 | -536.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 997 | 101154 | 0 | Cash loans | 1 | Y | Y | 0 | 144000.0 | 517536.0 | 28206.0 | ... | NaN | NaN | 0.0 | 0.0 | -2340.0 | 1 | 0.0 | 0.0 | 0.0 | 3.0 |
| 998 | 101155 | 0 | Cash loans | 0 | N | Y | 0 | 315000.0 | 1288350.0 | 37800.0 | ... | Panel | No | 0.0 | 0.0 | -631.0 | 1 | 0.0 | 1.0 | 1.0 | 1.0 |
| 999 | 101156 | 0 | Cash loans | 0 | Y | Y | 2 | 180000.0 | 679500.0 | 27076.5 | ... | NaN | NaN | 0.0 | 0.0 | -1743.0 | 1 | 0.0 | 0.0 | 0.0 | 3.0 |
1000 rows × 94 columns
plt.figure(figsize=[18,10])
sns.swarmplot(x= 'CODE_GENDER', y= 'TARGET', data= df_temp, cmap="mako")
/usr/local/lib/python3.8/dist-packages/seaborn/categorical.py:1296: UserWarning: 68.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /usr/local/lib/python3.8/dist-packages/seaborn/categorical.py:1296: UserWarning: 82.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning)
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a878be50>
tr_corr = df_train.corr()
plt.figure(figsize=[20, 20])
sns.dark_palette("#69d", reverse=True, as_cmap=True)
sns.heatmap(tr_corr, cmap="crest", square=True, linewidth=0.05, center= 0, cbar_kws={"shrink": .5}, linecolor= 'black')
plt.title('Correlation Matrix of Train Dataset', pad= 20, fontdict= {'fontsize' : 25, 'fontweight' : 50})
Text(0.5, 1.0, 'Correlation Matrix of Train Dataset')
df_test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
print("Most positively correlated attributes:\n", tr_corr.tail(10))
print("\n\n--------------------------------------------------------------\n\n")
print("\nMost negatively correlated attributes:\n", tr_corr.head(10))
Most positively correlated attributes:
SK_ID_CURR TARGET CNT_CHILDREN \
NONLIVINGAREA_MEDI 0.002434 -0.013340 0.000070
TOTALAREA_MODE 0.002285 -0.032600 -0.008013
OBS_30_CNT_SOCIAL_CIRCLE -0.001419 0.009116 0.015590
OBS_60_CNT_SOCIAL_CIRCLE -0.001434 0.009007 0.015229
DAYS_LAST_PHONE_CHANGE -0.000845 0.055228 -0.005875
FLAG_DOCUMENT_3 -0.003420 0.044371 0.056880
AMT_REQ_CREDIT_BUREAU_WEEK 0.002098 0.000787 -0.002435
AMT_REQ_CREDIT_BUREAU_MON 0.000500 -0.012461 -0.010833
AMT_REQ_CREDIT_BUREAU_QRT 0.001022 -0.002023 -0.007831
AMT_REQ_CREDIT_BUREAU_YEAR 0.004668 0.019932 -0.041556
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY \
NONLIVINGAREA_MEDI 0.070850 0.035829 0.047018
TOTALAREA_MODE 0.041983 0.072813 0.090404
OBS_30_CNT_SOCIAL_CIRCLE -0.013100 0.000190 -0.011978
OBS_60_CNT_SOCIAL_CIRCLE -0.013017 0.000239 -0.011722
DAYS_LAST_PHONE_CHANGE -0.018586 -0.073712 -0.063761
FLAG_DOCUMENT_3 -0.016747 0.096370 0.102861
AMT_REQ_CREDIT_BUREAU_WEEK 0.002387 -0.001277 0.013879
AMT_REQ_CREDIT_BUREAU_MON 0.024701 0.054460 0.039158
AMT_REQ_CREDIT_BUREAU_QRT 0.004858 0.015923 0.010120
AMT_REQ_CREDIT_BUREAU_YEAR 0.011687 -0.048447 -0.011319
AMT_GOODS_PRICE REGION_POPULATION_RELATIVE \
NONLIVINGAREA_MEDI 0.039877 0.066070
TOTALAREA_MODE 0.077527 0.202154
OBS_30_CNT_SOCIAL_CIRCLE 0.000319 -0.011175
OBS_60_CNT_SOCIAL_CIRCLE 0.000338 -0.010654
DAYS_LAST_PHONE_CHANGE -0.076322 -0.044009
FLAG_DOCUMENT_3 0.074912 -0.084630
AMT_REQ_CREDIT_BUREAU_WEEK -0.001008 -0.002147
AMT_REQ_CREDIT_BUREAU_MON 0.056432 0.078606
AMT_REQ_CREDIT_BUREAU_QRT 0.016426 -0.001282
AMT_REQ_CREDIT_BUREAU_YEAR -0.051005 0.000986
DAYS_BIRTH DAYS_EMPLOYED ... \
NONLIVINGAREA_MEDI 0.006328 -0.012982 ...
TOTALAREA_MODE 0.001349 -0.015133 ...
OBS_30_CNT_SOCIAL_CIRCLE 0.007393 0.005743 ...
OBS_60_CNT_SOCIAL_CIRCLE 0.006990 0.005896 ...
DAYS_LAST_PHONE_CHANGE 0.082944 0.023037 ...
FLAG_DOCUMENT_3 0.109700 -0.249120 ...
AMT_REQ_CREDIT_BUREAU_WEEK -0.001333 0.003070 ...
AMT_REQ_CREDIT_BUREAU_MON 0.001370 -0.034455 ...
AMT_REQ_CREDIT_BUREAU_QRT -0.011801 0.015343 ...
AMT_REQ_CREDIT_BUREAU_YEAR -0.071997 0.049992 ...
NONLIVINGAREA_MEDI TOTALAREA_MODE \
NONLIVINGAREA_MEDI 1.000000 0.361625
TOTALAREA_MODE 0.361625 1.000000
OBS_30_CNT_SOCIAL_CIRCLE -0.014429 -0.025587
OBS_60_CNT_SOCIAL_CIRCLE -0.014310 -0.025230
DAYS_LAST_PHONE_CHANGE -0.001909 -0.008279
FLAG_DOCUMENT_3 -0.027622 -0.052919
AMT_REQ_CREDIT_BUREAU_WEEK -0.006232 -0.000870
AMT_REQ_CREDIT_BUREAU_MON 0.012896 0.034814
AMT_REQ_CREDIT_BUREAU_QRT -0.002848 -0.003489
AMT_REQ_CREDIT_BUREAU_YEAR -0.008826 -0.018927
OBS_30_CNT_SOCIAL_CIRCLE \
NONLIVINGAREA_MEDI -0.014429
TOTALAREA_MODE -0.025587
OBS_30_CNT_SOCIAL_CIRCLE 1.000000
OBS_60_CNT_SOCIAL_CIRCLE 0.998489
DAYS_LAST_PHONE_CHANGE -0.014769
FLAG_DOCUMENT_3 0.026208
AMT_REQ_CREDIT_BUREAU_WEEK 0.000084
AMT_REQ_CREDIT_BUREAU_MON 0.001575
AMT_REQ_CREDIT_BUREAU_QRT 0.004129
AMT_REQ_CREDIT_BUREAU_YEAR 0.032534
OBS_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE \
NONLIVINGAREA_MEDI -0.014310 -0.001909
TOTALAREA_MODE -0.025230 -0.008279
OBS_30_CNT_SOCIAL_CIRCLE 0.998489 -0.014769
OBS_60_CNT_SOCIAL_CIRCLE 1.000000 -0.015185
DAYS_LAST_PHONE_CHANGE -0.015185 1.000000
FLAG_DOCUMENT_3 0.026254 -0.061750
AMT_REQ_CREDIT_BUREAU_WEEK 0.000200 -0.003494
AMT_REQ_CREDIT_BUREAU_MON 0.001651 -0.041251
AMT_REQ_CREDIT_BUREAU_QRT 0.003921 -0.001445
AMT_REQ_CREDIT_BUREAU_YEAR 0.032924 -0.112719
FLAG_DOCUMENT_3 AMT_REQ_CREDIT_BUREAU_WEEK \
NONLIVINGAREA_MEDI -0.027622 -0.006232
TOTALAREA_MODE -0.052919 -0.000870
OBS_30_CNT_SOCIAL_CIRCLE 0.026208 0.000084
OBS_60_CNT_SOCIAL_CIRCLE 0.026254 0.000200
DAYS_LAST_PHONE_CHANGE -0.061750 -0.003494
FLAG_DOCUMENT_3 1.000000 0.008170
AMT_REQ_CREDIT_BUREAU_WEEK 0.008170 1.000000
AMT_REQ_CREDIT_BUREAU_MON 0.010369 -0.014095
AMT_REQ_CREDIT_BUREAU_QRT 0.009572 -0.015115
AMT_REQ_CREDIT_BUREAU_YEAR 0.048228 0.018919
AMT_REQ_CREDIT_BUREAU_MON \
NONLIVINGAREA_MEDI 0.012896
TOTALAREA_MODE 0.034814
OBS_30_CNT_SOCIAL_CIRCLE 0.001575
OBS_60_CNT_SOCIAL_CIRCLE 0.001651
DAYS_LAST_PHONE_CHANGE -0.041251
FLAG_DOCUMENT_3 0.010369
AMT_REQ_CREDIT_BUREAU_WEEK -0.014095
AMT_REQ_CREDIT_BUREAU_MON 1.000000
AMT_REQ_CREDIT_BUREAU_QRT -0.007785
AMT_REQ_CREDIT_BUREAU_YEAR -0.004986
AMT_REQ_CREDIT_BUREAU_QRT \
NONLIVINGAREA_MEDI -0.002848
TOTALAREA_MODE -0.003489
OBS_30_CNT_SOCIAL_CIRCLE 0.004129
OBS_60_CNT_SOCIAL_CIRCLE 0.003921
DAYS_LAST_PHONE_CHANGE -0.001445
FLAG_DOCUMENT_3 0.009572
AMT_REQ_CREDIT_BUREAU_WEEK -0.015115
AMT_REQ_CREDIT_BUREAU_MON -0.007785
AMT_REQ_CREDIT_BUREAU_QRT 1.000000
AMT_REQ_CREDIT_BUREAU_YEAR 0.076201
AMT_REQ_CREDIT_BUREAU_YEAR
NONLIVINGAREA_MEDI -0.008826
TOTALAREA_MODE -0.018927
OBS_30_CNT_SOCIAL_CIRCLE 0.032534
OBS_60_CNT_SOCIAL_CIRCLE 0.032924
DAYS_LAST_PHONE_CHANGE -0.112719
FLAG_DOCUMENT_3 0.048228
AMT_REQ_CREDIT_BUREAU_WEEK 0.018919
AMT_REQ_CREDIT_BUREAU_MON -0.004986
AMT_REQ_CREDIT_BUREAU_QRT 0.076201
AMT_REQ_CREDIT_BUREAU_YEAR 1.000000
[10 rows x 78 columns]
--------------------------------------------------------------
Most negatively correlated attributes:
SK_ID_CURR TARGET CNT_CHILDREN \
SK_ID_CURR 1.000000 -0.002137 -0.001140
TARGET -0.002137 1.000000 0.019143
CNT_CHILDREN -0.001140 0.019143 1.000000
AMT_INCOME_TOTAL -0.001808 -0.003970 0.012897
AMT_CREDIT -0.000346 -0.030390 0.002139
AMT_ANNUITY -0.000417 -0.012814 0.021386
AMT_GOODS_PRICE -0.000239 -0.039671 -0.001836
REGION_POPULATION_RELATIVE 0.000864 -0.037223 -0.025573
DAYS_BIRTH -0.001514 0.078236 0.330937
DAYS_EMPLOYED 0.001363 -0.044927 -0.239816
AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY \
SK_ID_CURR -0.001808 -0.000346 -0.000417
TARGET -0.003970 -0.030390 -0.012814
CNT_CHILDREN 0.012897 0.002139 0.021386
AMT_INCOME_TOTAL 1.000000 0.156873 0.191652
AMT_CREDIT 0.156873 1.000000 0.770147
AMT_ANNUITY 0.191652 0.770147 1.000000
AMT_GOODS_PRICE 0.159613 0.986971 0.775119
REGION_POPULATION_RELATIVE 0.074792 0.099745 0.118417
DAYS_BIRTH 0.027264 -0.055435 0.009464
DAYS_EMPLOYED -0.064222 -0.066839 -0.104332
AMT_GOODS_PRICE REGION_POPULATION_RELATIVE \
SK_ID_CURR -0.000239 0.000864
TARGET -0.039671 -0.037223
CNT_CHILDREN -0.001836 -0.025573
AMT_INCOME_TOTAL 0.159613 0.074792
AMT_CREDIT 0.986971 0.099745
AMT_ANNUITY 0.775119 0.118417
AMT_GOODS_PRICE 1.000000 0.103524
REGION_POPULATION_RELATIVE 0.103524 1.000000
DAYS_BIRTH -0.053445 -0.029578
DAYS_EMPLOYED -0.064841 -0.003974
DAYS_BIRTH DAYS_EMPLOYED ... \
SK_ID_CURR -0.001514 0.001363 ...
TARGET 0.078236 -0.044927 ...
CNT_CHILDREN 0.330937 -0.239816 ...
AMT_INCOME_TOTAL 0.027264 -0.064222 ...
AMT_CREDIT -0.055435 -0.066839 ...
AMT_ANNUITY 0.009464 -0.104332 ...
AMT_GOODS_PRICE -0.053445 -0.064841 ...
REGION_POPULATION_RELATIVE -0.029578 -0.003974 ...
DAYS_BIRTH 1.000000 -0.615870 ...
DAYS_EMPLOYED -0.615870 1.000000 ...
NONLIVINGAREA_MEDI TOTALAREA_MODE \
SK_ID_CURR 0.002434 0.002285
TARGET -0.013340 -0.032600
CNT_CHILDREN 0.000070 -0.008013
AMT_INCOME_TOTAL 0.070850 0.041983
AMT_CREDIT 0.035829 0.072813
AMT_ANNUITY 0.047018 0.090404
AMT_GOODS_PRICE 0.039877 0.077527
REGION_POPULATION_RELATIVE 0.066070 0.202154
DAYS_BIRTH 0.006328 0.001349
DAYS_EMPLOYED -0.012982 -0.015133
OBS_30_CNT_SOCIAL_CIRCLE \
SK_ID_CURR -0.001419
TARGET 0.009116
CNT_CHILDREN 0.015590
AMT_INCOME_TOTAL -0.013100
AMT_CREDIT 0.000190
AMT_ANNUITY -0.011978
AMT_GOODS_PRICE 0.000319
REGION_POPULATION_RELATIVE -0.011175
DAYS_BIRTH 0.007393
DAYS_EMPLOYED 0.005743
OBS_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE \
SK_ID_CURR -0.001434 -0.000845
TARGET 0.009007 0.055228
CNT_CHILDREN 0.015229 -0.005875
AMT_INCOME_TOTAL -0.013017 -0.018586
AMT_CREDIT 0.000239 -0.073712
AMT_ANNUITY -0.011722 -0.063761
AMT_GOODS_PRICE 0.000338 -0.076322
REGION_POPULATION_RELATIVE -0.010654 -0.044009
DAYS_BIRTH 0.006990 0.082944
DAYS_EMPLOYED 0.005896 0.023037
FLAG_DOCUMENT_3 AMT_REQ_CREDIT_BUREAU_WEEK \
SK_ID_CURR -0.003420 0.002098
TARGET 0.044371 0.000787
CNT_CHILDREN 0.056880 -0.002435
AMT_INCOME_TOTAL -0.016747 0.002387
AMT_CREDIT 0.096370 -0.001277
AMT_ANNUITY 0.102861 0.013879
AMT_GOODS_PRICE 0.074912 -0.001008
REGION_POPULATION_RELATIVE -0.084630 -0.002147
DAYS_BIRTH 0.109700 -0.001333
DAYS_EMPLOYED -0.249120 0.003070
AMT_REQ_CREDIT_BUREAU_MON \
SK_ID_CURR 0.000500
TARGET -0.012461
CNT_CHILDREN -0.010833
AMT_INCOME_TOTAL 0.024701
AMT_CREDIT 0.054460
AMT_ANNUITY 0.039158
AMT_GOODS_PRICE 0.056432
REGION_POPULATION_RELATIVE 0.078606
DAYS_BIRTH 0.001370
DAYS_EMPLOYED -0.034455
AMT_REQ_CREDIT_BUREAU_QRT \
SK_ID_CURR 0.001022
TARGET -0.002023
CNT_CHILDREN -0.007831
AMT_INCOME_TOTAL 0.004858
AMT_CREDIT 0.015923
AMT_ANNUITY 0.010120
AMT_GOODS_PRICE 0.016426
REGION_POPULATION_RELATIVE -0.001282
DAYS_BIRTH -0.011801
DAYS_EMPLOYED 0.015343
AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR 0.004668
TARGET 0.019932
CNT_CHILDREN -0.041556
AMT_INCOME_TOTAL 0.011687
AMT_CREDIT -0.048447
AMT_ANNUITY -0.011319
AMT_GOODS_PRICE -0.051005
REGION_POPULATION_RELATIVE 0.000986
DAYS_BIRTH -0.071997
DAYS_EMPLOYED 0.049992
[10 rows x 78 columns]
ts_corr = df_test.corr()
plt.figure(figsize=[20, 20])
sns.dark_palette("#69d", reverse=True, as_cmap=True)
sns.heatmap(ts_corr, cmap="rocket", square=True, linewidth=0.05, center= 0, cbar_kws={"shrink": .5}, linecolor= 'black')
plt.title('Correlation Matrix of Test Dataset', pad= 20, fontdict= {'fontsize' : 25, 'fontweight' : 50})
Text(0.5, 1.0, 'Correlation Matrix of Test Dataset')
print("Most Positive Correlations:\n", ts_corr.tail(10))
print("\n\n-----------------------------------------------\n\n")
print("Most Negative Correlations:\n", ts_corr.head(10))
Most Positive Correlations:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL \
FLAG_DOCUMENT_18 -0.006286 -0.000862 -0.006624
FLAG_DOCUMENT_19 NaN NaN NaN
FLAG_DOCUMENT_20 NaN NaN NaN
FLAG_DOCUMENT_21 NaN NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.000307 0.006362 0.010227
AMT_REQ_CREDIT_BUREAU_DAY 0.001083 0.001539 0.004989
AMT_REQ_CREDIT_BUREAU_WEEK 0.001178 0.007523 -0.002867
AMT_REQ_CREDIT_BUREAU_MON 0.000430 -0.008337 0.008691
AMT_REQ_CREDIT_BUREAU_QRT -0.002092 0.029006 0.007410
AMT_REQ_CREDIT_BUREAU_YEAR 0.003457 -0.039265 0.003281
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE \
FLAG_DOCUMENT_18 -0.000197 -0.010762 0.004489
FLAG_DOCUMENT_19 NaN NaN NaN
FLAG_DOCUMENT_20 NaN NaN NaN
FLAG_DOCUMENT_21 NaN NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.001092 0.008428 -0.000671
AMT_REQ_CREDIT_BUREAU_DAY 0.004882 0.006681 0.004865
AMT_REQ_CREDIT_BUREAU_WEEK 0.002904 0.003085 0.003358
AMT_REQ_CREDIT_BUREAU_MON -0.000156 0.005695 -0.000254
AMT_REQ_CREDIT_BUREAU_QRT -0.007750 0.012443 -0.008490
AMT_REQ_CREDIT_BUREAU_YEAR -0.034533 -0.044901 -0.036227
REGION_POPULATION_RELATIVE DAYS_BIRTH \
FLAG_DOCUMENT_18 0.013449 0.030752
FLAG_DOCUMENT_19 NaN NaN
FLAG_DOCUMENT_20 NaN NaN
FLAG_DOCUMENT_21 NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.001408 0.008861
AMT_REQ_CREDIT_BUREAU_DAY -0.011773 -0.000386
AMT_REQ_CREDIT_BUREAU_WEEK -0.008321 0.012422
AMT_REQ_CREDIT_BUREAU_MON 0.000105 0.014094
AMT_REQ_CREDIT_BUREAU_QRT -0.026650 0.088752
AMT_REQ_CREDIT_BUREAU_YEAR 0.001015 -0.095551
DAYS_EMPLOYED DAYS_REGISTRATION ... \
FLAG_DOCUMENT_18 -0.018859 0.008590 ...
FLAG_DOCUMENT_19 NaN NaN ...
FLAG_DOCUMENT_20 NaN NaN ...
FLAG_DOCUMENT_21 NaN NaN ...
AMT_REQ_CREDIT_BUREAU_HOUR -0.012991 0.006212 ...
AMT_REQ_CREDIT_BUREAU_DAY -0.000785 -0.000152 ...
AMT_REQ_CREDIT_BUREAU_WEEK -0.014058 0.008692 ...
AMT_REQ_CREDIT_BUREAU_MON -0.013891 0.007414 ...
AMT_REQ_CREDIT_BUREAU_QRT -0.044351 0.046011 ...
AMT_REQ_CREDIT_BUREAU_YEAR 0.064698 -0.036887 ...
FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 \
FLAG_DOCUMENT_18 1.000000 NaN
FLAG_DOCUMENT_19 NaN NaN
FLAG_DOCUMENT_20 NaN NaN
FLAG_DOCUMENT_21 NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.001761 NaN
AMT_REQ_CREDIT_BUREAU_DAY -0.001515 NaN
AMT_REQ_CREDIT_BUREAU_WEEK 0.009205 NaN
AMT_REQ_CREDIT_BUREAU_MON -0.003248 NaN
AMT_REQ_CREDIT_BUREAU_QRT -0.010480 NaN
AMT_REQ_CREDIT_BUREAU_YEAR -0.009864 NaN
FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
FLAG_DOCUMENT_18 NaN NaN
FLAG_DOCUMENT_19 NaN NaN
FLAG_DOCUMENT_20 NaN NaN
FLAG_DOCUMENT_21 NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR NaN NaN
AMT_REQ_CREDIT_BUREAU_DAY NaN NaN
AMT_REQ_CREDIT_BUREAU_WEEK NaN NaN
AMT_REQ_CREDIT_BUREAU_MON NaN NaN
AMT_REQ_CREDIT_BUREAU_QRT NaN NaN
AMT_REQ_CREDIT_BUREAU_YEAR NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR \
FLAG_DOCUMENT_18 -0.001761
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR 1.000000
AMT_REQ_CREDIT_BUREAU_DAY 0.151506
AMT_REQ_CREDIT_BUREAU_WEEK -0.002345
AMT_REQ_CREDIT_BUREAU_MON 0.023510
AMT_REQ_CREDIT_BUREAU_QRT -0.003075
AMT_REQ_CREDIT_BUREAU_YEAR 0.011938
AMT_REQ_CREDIT_BUREAU_DAY \
FLAG_DOCUMENT_18 -0.001515
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR 0.151506
AMT_REQ_CREDIT_BUREAU_DAY 1.000000
AMT_REQ_CREDIT_BUREAU_WEEK 0.035567
AMT_REQ_CREDIT_BUREAU_MON 0.005877
AMT_REQ_CREDIT_BUREAU_QRT 0.006509
AMT_REQ_CREDIT_BUREAU_YEAR 0.002002
AMT_REQ_CREDIT_BUREAU_WEEK \
FLAG_DOCUMENT_18 0.009205
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.002345
AMT_REQ_CREDIT_BUREAU_DAY 0.035567
AMT_REQ_CREDIT_BUREAU_WEEK 1.000000
AMT_REQ_CREDIT_BUREAU_MON 0.054291
AMT_REQ_CREDIT_BUREAU_QRT 0.024957
AMT_REQ_CREDIT_BUREAU_YEAR -0.000252
AMT_REQ_CREDIT_BUREAU_MON \
FLAG_DOCUMENT_18 -0.003248
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR 0.023510
AMT_REQ_CREDIT_BUREAU_DAY 0.005877
AMT_REQ_CREDIT_BUREAU_WEEK 0.054291
AMT_REQ_CREDIT_BUREAU_MON 1.000000
AMT_REQ_CREDIT_BUREAU_QRT 0.005446
AMT_REQ_CREDIT_BUREAU_YEAR 0.026118
AMT_REQ_CREDIT_BUREAU_QRT \
FLAG_DOCUMENT_18 -0.010480
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR -0.003075
AMT_REQ_CREDIT_BUREAU_DAY 0.006509
AMT_REQ_CREDIT_BUREAU_WEEK 0.024957
AMT_REQ_CREDIT_BUREAU_MON 0.005446
AMT_REQ_CREDIT_BUREAU_QRT 1.000000
AMT_REQ_CREDIT_BUREAU_YEAR -0.013081
AMT_REQ_CREDIT_BUREAU_YEAR
FLAG_DOCUMENT_18 -0.009864
FLAG_DOCUMENT_19 NaN
FLAG_DOCUMENT_20 NaN
FLAG_DOCUMENT_21 NaN
AMT_REQ_CREDIT_BUREAU_HOUR 0.011938
AMT_REQ_CREDIT_BUREAU_DAY 0.002002
AMT_REQ_CREDIT_BUREAU_WEEK -0.000252
AMT_REQ_CREDIT_BUREAU_MON 0.026118
AMT_REQ_CREDIT_BUREAU_QRT -0.013081
AMT_REQ_CREDIT_BUREAU_YEAR 1.000000
[10 rows x 105 columns]
-----------------------------------------------
Most Negative Correlations:
SK_ID_CURR CNT_CHILDREN AMT_INCOME_TOTAL \
SK_ID_CURR 1.000000 0.000635 0.001278
CNT_CHILDREN 0.000635 1.000000 0.038962
AMT_INCOME_TOTAL 0.001278 0.038962 1.000000
AMT_CREDIT 0.005014 0.027840 0.396572
AMT_ANNUITY 0.007112 0.056770 0.457833
AMT_GOODS_PRICE 0.005097 0.025507 0.401995
REGION_POPULATION_RELATIVE 0.003324 -0.015231 0.199773
DAYS_BIRTH 0.002325 0.317877 0.054400
DAYS_EMPLOYED -0.000845 -0.238319 -0.154619
DAYS_REGISTRATION 0.001032 0.175054 0.067973
AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE \
SK_ID_CURR 0.005014 0.007112 0.005097
CNT_CHILDREN 0.027840 0.056770 0.025507
AMT_INCOME_TOTAL 0.396572 0.457833 0.401995
AMT_CREDIT 1.000000 0.777733 0.988056
AMT_ANNUITY 0.777733 1.000000 0.787033
AMT_GOODS_PRICE 0.988056 0.787033 1.000000
REGION_POPULATION_RELATIVE 0.135694 0.150864 0.141453
DAYS_BIRTH -0.046169 0.047859 -0.039308
DAYS_EMPLOYED -0.083483 -0.137772 -0.086425
DAYS_REGISTRATION 0.030740 0.064450 0.033619
REGION_POPULATION_RELATIVE DAYS_BIRTH \
SK_ID_CURR 0.003324 0.002325
CNT_CHILDREN -0.015231 0.317877
AMT_INCOME_TOTAL 0.199773 0.054400
AMT_CREDIT 0.135694 -0.046169
AMT_ANNUITY 0.150864 0.047859
AMT_GOODS_PRICE 0.141453 -0.039308
REGION_POPULATION_RELATIVE 1.000000 -0.033339
DAYS_BIRTH -0.033339 1.000000
DAYS_EMPLOYED 0.000209 -0.637129
DAYS_REGISTRATION -0.053130 0.320174
DAYS_EMPLOYED DAYS_REGISTRATION ... \
SK_ID_CURR -0.000845 0.001032 ...
CNT_CHILDREN -0.238319 0.175054 ...
AMT_INCOME_TOTAL -0.154619 0.067973 ...
AMT_CREDIT -0.083483 0.030740 ...
AMT_ANNUITY -0.137772 0.064450 ...
AMT_GOODS_PRICE -0.086425 0.033619 ...
REGION_POPULATION_RELATIVE 0.000209 -0.053130 ...
DAYS_BIRTH -0.637129 0.320174 ...
DAYS_EMPLOYED 1.000000 -0.209123 ...
DAYS_REGISTRATION -0.209123 1.000000 ...
FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 \
SK_ID_CURR -0.006286 NaN
CNT_CHILDREN -0.000862 NaN
AMT_INCOME_TOTAL -0.006624 NaN
AMT_CREDIT -0.000197 NaN
AMT_ANNUITY -0.010762 NaN
AMT_GOODS_PRICE 0.004489 NaN
REGION_POPULATION_RELATIVE 0.013449 NaN
DAYS_BIRTH 0.030752 NaN
DAYS_EMPLOYED -0.018859 NaN
DAYS_REGISTRATION 0.008590 NaN
FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 \
SK_ID_CURR NaN NaN
CNT_CHILDREN NaN NaN
AMT_INCOME_TOTAL NaN NaN
AMT_CREDIT NaN NaN
AMT_ANNUITY NaN NaN
AMT_GOODS_PRICE NaN NaN
REGION_POPULATION_RELATIVE NaN NaN
DAYS_BIRTH NaN NaN
DAYS_EMPLOYED NaN NaN
DAYS_REGISTRATION NaN NaN
AMT_REQ_CREDIT_BUREAU_HOUR \
SK_ID_CURR -0.000307
CNT_CHILDREN 0.006362
AMT_INCOME_TOTAL 0.010227
AMT_CREDIT -0.001092
AMT_ANNUITY 0.008428
AMT_GOODS_PRICE -0.000671
REGION_POPULATION_RELATIVE -0.001408
DAYS_BIRTH 0.008861
DAYS_EMPLOYED -0.012991
DAYS_REGISTRATION 0.006212
AMT_REQ_CREDIT_BUREAU_DAY \
SK_ID_CURR 0.001083
CNT_CHILDREN 0.001539
AMT_INCOME_TOTAL 0.004989
AMT_CREDIT 0.004882
AMT_ANNUITY 0.006681
AMT_GOODS_PRICE 0.004865
REGION_POPULATION_RELATIVE -0.011773
DAYS_BIRTH -0.000386
DAYS_EMPLOYED -0.000785
DAYS_REGISTRATION -0.000152
AMT_REQ_CREDIT_BUREAU_WEEK \
SK_ID_CURR 0.001178
CNT_CHILDREN 0.007523
AMT_INCOME_TOTAL -0.002867
AMT_CREDIT 0.002904
AMT_ANNUITY 0.003085
AMT_GOODS_PRICE 0.003358
REGION_POPULATION_RELATIVE -0.008321
DAYS_BIRTH 0.012422
DAYS_EMPLOYED -0.014058
DAYS_REGISTRATION 0.008692
AMT_REQ_CREDIT_BUREAU_MON \
SK_ID_CURR 0.000430
CNT_CHILDREN -0.008337
AMT_INCOME_TOTAL 0.008691
AMT_CREDIT -0.000156
AMT_ANNUITY 0.005695
AMT_GOODS_PRICE -0.000254
REGION_POPULATION_RELATIVE 0.000105
DAYS_BIRTH 0.014094
DAYS_EMPLOYED -0.013891
DAYS_REGISTRATION 0.007414
AMT_REQ_CREDIT_BUREAU_QRT \
SK_ID_CURR -0.002092
CNT_CHILDREN 0.029006
AMT_INCOME_TOTAL 0.007410
AMT_CREDIT -0.007750
AMT_ANNUITY 0.012443
AMT_GOODS_PRICE -0.008490
REGION_POPULATION_RELATIVE -0.026650
DAYS_BIRTH 0.088752
DAYS_EMPLOYED -0.044351
DAYS_REGISTRATION 0.046011
AMT_REQ_CREDIT_BUREAU_YEAR
SK_ID_CURR 0.003457
CNT_CHILDREN -0.039265
AMT_INCOME_TOTAL 0.003281
AMT_CREDIT -0.034533
AMT_ANNUITY -0.044901
AMT_GOODS_PRICE -0.036227
REGION_POPULATION_RELATIVE 0.001015
DAYS_BIRTH -0.095551
DAYS_EMPLOYED 0.064698
DAYS_REGISTRATION -0.036887
[10 rows x 105 columns]
df_categorical.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 307500 entries, 0 to 307510 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 NAME_CONTRACT_TYPE 307500 non-null object 1 CODE_GENDER 307500 non-null object 2 FLAG_OWN_CAR 307500 non-null object 3 FLAG_OWN_REALTY 307500 non-null object 4 NAME_TYPE_SUITE 306210 non-null object 5 NAME_INCOME_TYPE 307500 non-null object 6 NAME_EDUCATION_TYPE 307500 non-null object 7 NAME_FAMILY_STATUS 307500 non-null object 8 NAME_HOUSING_TYPE 307500 non-null object 9 WEEKDAY_APPR_PROCESS_START 307500 non-null object 10 ORGANIZATION_TYPE 307500 non-null object dtypes: object(11) memory usage: 28.2+ MB
eda_cat1 = df_categorical['FLAG_OWN_REALTY'].value_counts()
print(eda_cat1)
plt.figure(figsize=[10,8])
sns.countplot(df_categorical['FLAG_OWN_REALTY'],palette = 'Reds')
plt.title("Percentage of loan in accordance to REALTY", fontweight = 'bold', fontsize = 14)
Y 213302 N 94198 Name: FLAG_OWN_REALTY, dtype: int64
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to REALTY')
eda_cat2 = df_categorical['NAME_INCOME_TYPE'].value_counts()
print(eda_cat2)
plt.figure(figsize=[15,10])
sns.countplot(df_categorical['NAME_INCOME_TYPE'],palette = 'Blues')
plt.title("Percentage of loan in accordance to Income Type", fontweight = 'bold', fontsize = 20)
Working 158771 Commercial associate 71614 Pensioner 55362 State servant 21703 Unemployed 22 Student 18 Businessman 10 Name: NAME_INCOME_TYPE, dtype: int64
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to Income Type')
eda_cat3 = df_categorical['NAME_CONTRACT_TYPE'].value_counts()
print(eda_cat3)
plt.figure(figsize=[10,8])
sns.countplot(df_categorical['NAME_CONTRACT_TYPE'],palette = 'Greens')
plt.title("Percentage of loan in accordance to Contract Type", fontweight = 'bold', fontsize = 16)
Cash loans 278230 Revolving loans 29270 Name: NAME_CONTRACT_TYPE, dtype: int64
/usr/local/lib/python3.8/dist-packages/seaborn/_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
Text(0.5, 1.0, 'Percentage of loan in accordance to Contract Type')
df_bureau = pd.read_csv('bureau.csv')
print("No. of entriese in bureau datasets: " + str(df_bureau.shape[0]) + "\n")
df_bureau.head
No. of entriese in bureau datasets: 1716428
<bound method NDFrame.head of SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT \
0 215354 5714462 Closed currency 1 -497
1 215354 5714463 Active currency 1 -208
2 215354 5714464 Active currency 1 -203
3 215354 5714465 Active currency 1 -203
4 215354 5714466 Active currency 1 -629
... ... ... ... ... ...
1716423 259355 5057750 Active currency 1 -44
1716424 100044 5057754 Closed currency 1 -2648
1716425 100044 5057762 Closed currency 1 -1809
1716426 246829 5057770 Closed currency 1 -1878
1716427 246829 5057778 Closed currency 1 -463
CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT \
0 0 -153.0 -153.0
1 0 1075.0 NaN
2 0 528.0 NaN
3 0 NaN NaN
4 0 1197.0 NaN
... ... ... ...
1716423 0 -30.0 NaN
1716424 0 -2433.0 -2493.0
1716425 0 -1628.0 -970.0
1716426 0 -1513.0 -1513.0
1716427 0 NaN -387.0
AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM \
0 NaN 0 91323.00
1 NaN 0 225000.00
2 NaN 0 464323.50
3 NaN 0 90000.00
4 77674.5 0 2700000.00
... ... ... ...
1716423 0.0 0 11250.00
1716424 5476.5 0 38130.84
1716425 NaN 0 15570.00
1716426 NaN 0 36000.00
1716427 NaN 0 22500.00
AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE \
0 0.0 NaN 0.0
1 171342.0 NaN 0.0
2 NaN NaN 0.0
3 NaN NaN 0.0
4 NaN NaN 0.0
... ... ... ...
1716423 11250.0 0.0 0.0
1716424 0.0 0.0 0.0
1716425 NaN NaN 0.0
1716426 0.0 0.0 0.0
1716427 0.0 NaN 0.0
CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 Consumer credit -131 NaN
1 Credit card -20 NaN
2 Consumer credit -16 NaN
3 Credit card -16 NaN
4 Consumer credit -21 NaN
... ... ... ...
1716423 Microloan -19 NaN
1716424 Consumer credit -2493 NaN
1716425 Consumer credit -967 NaN
1716426 Consumer credit -1508 NaN
1716427 Microloan -387 NaN
[1716428 rows x 17 columns]>
df_bureau.describe()
| SK_ID_CURR | SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.716428e+06 | 1.610875e+06 | 1.082775e+06 | 5.919400e+05 | 1.716428e+06 | 1.716415e+06 | 1.458759e+06 | 1.124648e+06 | 1.716428e+06 | 1.716428e+06 | 4.896370e+05 |
| mean | 2.782149e+05 | 5.924434e+06 | -1.142108e+03 | 8.181666e-01 | 5.105174e+02 | -1.017437e+03 | 3.825418e+03 | 6.410406e-03 | 3.549946e+05 | 1.370851e+05 | 6.229515e+03 | 3.791276e+01 | -5.937483e+02 | 1.571276e+04 |
| std | 1.029386e+05 | 5.322657e+05 | 7.951649e+02 | 3.654443e+01 | 4.994220e+03 | 7.140106e+02 | 2.060316e+05 | 9.622391e-02 | 1.149811e+06 | 6.774011e+05 | 4.503203e+04 | 5.937650e+03 | 7.207473e+02 | 3.258269e+05 |
| min | 1.000010e+05 | 5.000000e+06 | -2.922000e+03 | 0.000000e+00 | -4.206000e+04 | -4.202300e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -4.705600e+06 | -5.864061e+05 | 0.000000e+00 | -4.194700e+04 | 0.000000e+00 |
| 25% | 1.888668e+05 | 5.463954e+06 | -1.666000e+03 | 0.000000e+00 | -1.138000e+03 | -1.489000e+03 | 0.000000e+00 | 0.000000e+00 | 5.130000e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -9.080000e+02 | 0.000000e+00 |
| 50% | 2.780550e+05 | 5.926304e+06 | -9.870000e+02 | 0.000000e+00 | -3.300000e+02 | -8.970000e+02 | 0.000000e+00 | 0.000000e+00 | 1.255185e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | -3.950000e+02 | 0.000000e+00 |
| 75% | 3.674260e+05 | 6.385681e+06 | -4.740000e+02 | 0.000000e+00 | 4.740000e+02 | -4.250000e+02 | 0.000000e+00 | 0.000000e+00 | 3.150000e+05 | 4.015350e+04 | 0.000000e+00 | 0.000000e+00 | -3.300000e+01 | 1.350000e+04 |
| max | 4.562550e+05 | 6.843457e+06 | 0.000000e+00 | 2.792000e+03 | 3.119900e+04 | 0.000000e+00 | 1.159872e+08 | 9.000000e+00 | 5.850000e+08 | 1.701000e+08 | 4.705600e+06 | 3.756681e+06 | 3.720000e+02 | 1.184534e+08 |
bureau_corr = df_bureau.corr()
plt.figure(figsize=[20, 20])
sns.dark_palette("#69d", reverse=True, as_cmap=True)
sns.heatmap(bureau_corr, cmap="YlGnBu", square=True, linewidth=0.05, center= 0, cbar_kws={"shrink": .5}, linecolor= 'black')
plt.title('Correlation Matrix of Bureau Dataset', pad= 20, fontdict= {'fontsize' : 25, 'fontweight' : 50})
Text(0.5, 1.0, 'Correlation Matrix of Bureau Dataset')
plt.figure(figsize=(15,10))
print("Mean days of overdue credit: ", np.mean(df_bureau['CREDIT_DAY_OVERDUE']))
print("Fewest days of overdue credit: ",np.min(df_bureau['CREDIT_DAY_OVERDUE']))
print("Most days of overdue credit: ",np.max(df_bureau['CREDIT_DAY_OVERDUE']))
plt.scatter(df_bureau['CREDIT_DAY_OVERDUE'], df_bureau['CREDIT_TYPE'])
plt.title('Credit Types according to Bureau data'); plt.xlabel('# of days overdue (incl. 0 days)'); plt.ylabel('');
plt.show()
Mean days of overdue credit: 0.8181665645165425 Fewest days of overdue credit: 0 Most days of overdue credit: 2792
df_topOverdue = df_bureau.loc[df_bureau['CREDIT_DAY_OVERDUE'] >0]
print(df_topOverdue['CREDIT_TYPE'].value_counts())
print('------------------')
print("Mean days of overdue credit: ", np.mean(df_topOverdue['CREDIT_DAY_OVERDUE']))
print("Fewest days of overdue credit: ",np.min(df_topOverdue['CREDIT_DAY_OVERDUE']))
print("Most days of overdue credit: ",np.max(df_topOverdue['CREDIT_DAY_OVERDUE']))
plt.figure(figsize=(15,10))
plt.scatter(df_topOverdue['CREDIT_DAY_OVERDUE'], df_topOverdue['CREDIT_TYPE'])
plt.title('Overdue credit trends by credit type'); plt.xlabel('# of days overdue (excl. 0 days)'); plt.ylabel('');
plt.show()
Consumer credit 2628 Credit card 1466 Mortgage 53 Car loan 52 Microloan 10 Loan for business development 4 Loan for working capital replenishment 2 Another type of loan 2 Name: CREDIT_TYPE, dtype: int64 ------------------ Mean days of overdue credit: 333.0149395304719 Fewest days of overdue credit: 1 Most days of overdue credit: 2792
Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.
Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.
Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).
The below pipeline is for the project:
!pip install latexify-py==0.2.0
Requirement already satisfied: latexify-py==0.2.0 in c:\users\sharm\anaconda3\lib\site-packages (0.2.0) Requirement already satisfied: dill>=0.3.2 in c:\users\sharm\anaconda3\lib\site-packages (from latexify-py==0.2.0) (0.3.6)
import math
import latexify
The foundation of Naive Bayes is the Bayes Theorem and the presumption of predictor independence. This approach is especially helpful with a dataset this size because it makes building a model simple and doesn't require tedious iterative parameter estimation.
Using Bayes theorem, we can find the probability of A happening, given that B has occurred. So given the previous data of the applicant we can predict the probability of the applicant being loan defaulter.
@latexify.function
def P(A,B):
return (P(B|A)*P(A)/P(B))
P
#Here X=x1,x2,x3...xN аre list оf indeрendent рrediсtоrs y is the class label P(y|X) is the probability of label y given the predictors X
@latexify.function
def P(y,X):
return (P(X|y)*P(X)/P(y))
P
A dichotomous result's likelihood is predicted by logistic regression. Its foundation is the use of one or more predictors. The logistic curve created by this technique can only contain values between 0 and 1. Logistic Regression is used when the dependent variable(target) is categorical.
The dependent variable in logistic regression follows Bernoulli Distribution. Estimation is done through maximum likelihood. No R Square, Model fitness is calculated through Concordance, KS-Statistics
@latexify.function(use_math_symbols=True)
def LR(y):
return beta * 0 + beta * 1 * X * 1 + (...) + beta *N * X * N
LR
#Where, y is dependent variable and x1, x2 ... and Xn are explanatory variables.
##Sigmoid Function
@latexify.function
def S(p):
return 1 / (1 + e ** -y)
S
In Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration.
Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression. Random forest works on the Bagging principle. Bagging, also known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging chooses a random sample from the data set. Hence each model is generated from the samples (Bootstrap Samples) provided by the Original Data with replacement known as row sampling. This step of row sampling with replacement is called bootstrap. Now each model is trained independently which generates results. The final output is based on majority voting after combining the results of all models. This step which involves combining all the results and generating output based on majority voting is known as aggregation.
Steps involved in random forest algorithm:
In Random forest n number of random records are taken from the data set having k number of records.
Individual decision trees are constructed for each sample.
Each decision tree will generate an output.
Final output is considered based on Majority Voting or Averaging for Classification and regression respectively.
The results are evaluated on different metric on the models. After building the model we evaluate them on various metrics.
Log Loss: Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class classification. When working with Log Loss, the classifier must assign probability to each class for all the samples.
@latexify.function
def LogarithmicLoss(y):
return -1/N * sum( y[i][j] * math.log(p[i][j]) for i in range(1, N+1) for j in range(1, M+1))
LogarithmicLoss
#We are getting N+1-1 which is a bug in the source code
Accuracy: Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of number of correct predictions to the total number of input samples. It works well only if there are equal number of samples belonging to each class.
@latexify.function
def Accuracy(a):
return (Number_Of_correct_predictions/ Total_number_of_predictions_made)
Accuracy
Confusion Matrix: Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class and each column represents the instances in an actual class. Confusion Matrix is not exactly a performance metric but sort of a basis on which other metrics evaluate the results.
F1-Score: F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is, as well as how robust it is. High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model.
@latexify.function
def F1(a):
return (2 * (1/ ((1/Precision)+ (1/Recall))))
F1
ROC_AUC: The Receiver Operator Characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots the TPR against FPR at various threshold values and essentially separates the ‘signal’ from the ‘noise’. The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
import warnings
warnings.filterwarnings('ignore')
x = df_numerical.drop(["TARGET","SK_ID_CURR"], axis=1)
y = df_train["TARGET"].to_frame()
print(x.shape, y.shape)
x.head()
(307500, 76) (307500, 1)
| CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | ... | NONLIVINGAREA_MEDI | TOTALAREA_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | -9461 | -637 | -3648.0 | -2120 | ... | 0.00 | 0.0149 | 2.0 | 2.0 | -1134.0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | -16765 | -1188 | -1186.0 | -291 | ... | 0.01 | 0.0714 | 1.0 | 1.0 | -828.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | -19046 | -225 | -4260.0 | -2531 | ... | NaN | NaN | 0.0 | 0.0 | -815.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | ... | NaN | NaN | 2.0 | 2.0 | -617.0 | 1 | NaN | NaN | NaN | NaN |
| 4 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | ... | NaN | NaN | 0.0 | 0.0 | -1106.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 76 columns
# Create class for feature selection in the form of df columns
class dfSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Identify object-dtype features
cat_feat_raw = list(x.columns[x.dtypes.values == "O"].values)
print([i for i in cat_feat_raw])
print(f"\nNumber of categorical features before manual addition of other preencoded features: {len(cat_feat_raw)}")
[] Number of categorical features before manual addition of other preencoded features: 0
# Manual identification of other categorical features already encoded
cat_feat_add = [
"FLAG_MOBIL", "FLAG_EMP_PHONE", "FLAG_WORK_PHONE", "FLAG_CONT_MOBILE",
"FLAG_PHONE", "REGION_RATING_CLIENT", "REGION_RATING_CLIENT_W_CITY",
"REG_CITY_NOT_WORK_CITY", "LIVE_CITY_NOT_WORK_CITY",
"FLAG_DOCUMENT_3"]
print(f"\nNumber of categorical features to be added manually: {len(cat_feat_add)}")
Number of categorical features to be added manually: 10
cat_feat = cat_feat_raw + cat_feat_add
print(f"Number of categorical features: {len(cat_feat)}")
Number of categorical features: 10
# Baseline pipeline for categorical features
cat_pipe = Pipeline([
('selector', dfSelector(cat_feat)),
('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))]) # ignore values from validation/test data that do NOT occur in training set
num_feat = x.drop(cat_feat, axis = 1).columns
print(f"Number of numerical features: {len(num_feat)}")
Number of numerical features: 66
# Check if all columns selected
x.shape[1] == len(num_feat) + len(cat_feat)
True
# Baseline pipeline for numerical features
num_pipe = Pipeline([
('selector', dfSelector(num_feat)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler()),
])
data_prep_pipe = FeatureUnion(transformer_list=[ ("num_pipeline", num_pipe), ("cat_pipeline", cat_pipe),])
# Divide training data into actual training and validation (out-of-sample proxy) data
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=123)
print(f"x train shape: {x_train.shape}")
print(f"x validation shape: {x_valid.shape}")
print(f"y train shape: {y_train.shape}")
print(f"y validation shape: {y_valid.shape}")
x train shape: (246000, 76) x validation shape: (61500, 76) y train shape: (246000, 1) y validation shape: (61500, 1)
#Making sure the datatypes are same
print(type(x_train))
print(type(x_valid))
print(type(y_train))
print(type(y_valid))
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
x_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 246000 entries, 43358 to 249351 Data columns (total 76 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CNT_CHILDREN 246000 non-null int64 1 AMT_INCOME_TOTAL 246000 non-null float64 2 AMT_CREDIT 246000 non-null float64 3 AMT_ANNUITY 245990 non-null float64 4 AMT_GOODS_PRICE 245784 non-null float64 5 REGION_POPULATION_RELATIVE 246000 non-null float64 6 DAYS_BIRTH 246000 non-null int64 7 DAYS_EMPLOYED 246000 non-null int64 8 DAYS_REGISTRATION 246000 non-null float64 9 DAYS_ID_PUBLISH 246000 non-null int64 10 OWN_CAR_AGE 83522 non-null float64 11 FLAG_MOBIL 246000 non-null int64 12 FLAG_EMP_PHONE 246000 non-null int64 13 FLAG_WORK_PHONE 246000 non-null int64 14 FLAG_CONT_MOBILE 246000 non-null int64 15 FLAG_PHONE 246000 non-null int64 16 CNT_FAM_MEMBERS 246000 non-null float64 17 REGION_RATING_CLIENT 246000 non-null int64 18 REGION_RATING_CLIENT_W_CITY 246000 non-null int64 19 HOUR_APPR_PROCESS_START 246000 non-null int64 20 REG_CITY_NOT_WORK_CITY 246000 non-null int64 21 LIVE_CITY_NOT_WORK_CITY 246000 non-null int64 22 EXT_SOURCE_1 107323 non-null float64 23 EXT_SOURCE_2 245465 non-null float64 24 EXT_SOURCE_3 197206 non-null float64 25 APARTMENTS_AVG 121147 non-null float64 26 BASEMENTAREA_AVG 102093 non-null float64 27 YEARS_BEGINEXPLUATATION_AVG 125929 non-null float64 28 YEARS_BUILD_AVG 82489 non-null float64 29 COMMONAREA_AVG 74262 non-null float64 30 ELEVATORS_AVG 114894 non-null float64 31 ENTRANCES_AVG 122082 non-null float64 32 FLOORSMAX_AVG 123543 non-null float64 33 FLOORSMIN_AVG 79217 non-null float64 34 LANDAREA_AVG 99965 non-null float64 35 LIVINGAPARTMENTS_AVG 77984 non-null float64 36 LIVINGAREA_AVG 122465 non-null float64 37 NONLIVINGAPARTMENTS_AVG 75352 non-null float64 38 NONLIVINGAREA_AVG 110291 non-null float64 39 APARTMENTS_MODE 121147 non-null float64 40 BASEMENTAREA_MODE 102093 non-null float64 41 YEARS_BEGINEXPLUATATION_MODE 125929 non-null float64 42 YEARS_BUILD_MODE 82489 non-null float64 43 COMMONAREA_MODE 74262 non-null float64 44 ELEVATORS_MODE 114894 non-null float64 45 ENTRANCES_MODE 122082 non-null float64 46 FLOORSMAX_MODE 123543 non-null float64 47 FLOORSMIN_MODE 79217 non-null float64 48 LANDAREA_MODE 99965 non-null float64 49 LIVINGAPARTMENTS_MODE 77984 non-null float64 50 LIVINGAREA_MODE 122465 non-null float64 51 NONLIVINGAPARTMENTS_MODE 75352 non-null float64 52 NONLIVINGAREA_MODE 110291 non-null float64 53 APARTMENTS_MEDI 121147 non-null float64 54 BASEMENTAREA_MEDI 102093 non-null float64 55 YEARS_BEGINEXPLUATATION_MEDI 125929 non-null float64 56 YEARS_BUILD_MEDI 82489 non-null float64 57 COMMONAREA_MEDI 74262 non-null float64 58 ELEVATORS_MEDI 114894 non-null float64 59 ENTRANCES_MEDI 122082 non-null float64 60 FLOORSMAX_MEDI 123543 non-null float64 61 FLOORSMIN_MEDI 79217 non-null float64 62 LANDAREA_MEDI 99965 non-null float64 63 LIVINGAPARTMENTS_MEDI 77984 non-null float64 64 LIVINGAREA_MEDI 122465 non-null float64 65 NONLIVINGAPARTMENTS_MEDI 75352 non-null float64 66 NONLIVINGAREA_MEDI 110291 non-null float64 67 TOTALAREA_MODE 127198 non-null float64 68 OBS_30_CNT_SOCIAL_CIRCLE 245177 non-null float64 69 OBS_60_CNT_SOCIAL_CIRCLE 245177 non-null float64 70 DAYS_LAST_PHONE_CHANGE 245999 non-null float64 71 FLAG_DOCUMENT_3 246000 non-null int64 72 AMT_REQ_CREDIT_BUREAU_WEEK 212866 non-null float64 73 AMT_REQ_CREDIT_BUREAU_MON 212866 non-null float64 74 AMT_REQ_CREDIT_BUREAU_QRT 212866 non-null float64 75 AMT_REQ_CREDIT_BUREAU_YEAR 212866 non-null float64 dtypes: float64(61), int64(15) memory usage: 144.5 MB
baseline_pipe = Pipeline([
("data_prep", data_prep_pipe),
("Log_Reg", LogisticRegression(penalty = "none"))]) # no regularization for baseline model
# Fit whole pipeline
print(type(x_train))
print(type(y_train))
baseline_model = baseline_pipe.fit(x_train, y_train)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import RocCurveDisplay
# Evaluate performance on training data (limited validity)
y_train_pred = baseline_model.predict(x_train)
print("Confusion matrix (training data)")
print(confusion_matrix(y_train_pred, y_train))
print("------------------")
print(f"Overall accuracy (training data): {np.round(accuracy_score(y_train, y_train_pred), 3)*100}%")
Confusion matrix (training data) [[226001 19732] [ 122 145]] ------------------ Overall accuracy (training data): 91.9%
# Evaluate Performance on validation data (pipeline prediction only transforms yet unseen data, hence no leakage)
y_valid_pred = baseline_model.predict(x_valid)
print("Confusion matrix (validation data)")
print(confusion_matrix(y_valid_pred, y_valid))
print("------------------")
print(f"Overall accuracy (validation data): {np.round(accuracy_score(y_valid, y_valid_pred), 3)*100}%")
print("------------------")
print(f"AUROC (validation data): {np.round(roc_auc_score(y_valid, baseline_model.predict_proba(x_valid)[:, 1]), 3)*100}%")
print()
RocCurveDisplay.from_estimator(baseline_model, x_valid, y_valid)
plt.show()
Confusion matrix (validation data) [[56514 4914] [ 40 32]] ------------------ Overall accuracy (validation data): 91.9% ------------------ AUROC (validation data): 73.5%
# Create df to be sequentially filled in later stages of the project
expLog = pd.DataFrame(columns=["model_name",
"Train Acc",
"Valid Acc",
"Train AUC",
"Valid AUC",
"Comment (optional)"])
# Fill in baseline model performance numbers
expLog.loc[len(expLog)] = ["Baseline model: Logistic regression"] + list(np.round(
[accuracy_score(y_train, baseline_model.predict(x_train)),
accuracy_score(y_valid, baseline_model.predict(x_valid)),
roc_auc_score(y_train, baseline_model.predict_proba(x_train)[:, 1]),
roc_auc_score(y_valid, baseline_model.predict_proba(x_valid)[:, 1])], 3)) + [
"No regularization, feature selection, model tuning etc."]
expLog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | Baseline model: Logistic regression | 0.919 | 0.919 | 0.733 | 0.735 | No regularization, feature selection, model tu... |
feat_names = list(baseline_model.named_steps["data_prep"].transformer_list[0][1].named_steps['selector'].attribute_names)
cat_feat_names =list(baseline_model.named_steps["data_prep"].transformer_list[1][1].named_steps['selector'].attribute_names)
feat_names.extend(cat_feat_names)
coefs = baseline_model.named_steps["Log_Reg"].coef_.flatten()
#view features and coefs in a neat way
zipped = zip(feat_names, coefs)
df = pd.DataFrame(zipped, columns=["feature", "value"])
# Sort the features by the absolute value of their coefficient
df["abs_value"] = df["value"].apply(lambda x: abs(x))
df["colors"] = df["value"].apply(lambda x: "green" if x > 0 else "red")
df = df.sort_values("abs_value", ascending=False)
#view top 5 features...
df.head()
| feature | value | abs_value | colors | |
|---|---|---|---|---|
| 4 | AMT_GOODS_PRICE | -1.242445 | 1.242445 | red |
| 2 | AMT_CREDIT | 1.079424 | 1.079424 | green |
| 68 | FLAG_WORK_PHONE | -0.762423 | 0.762423 | red |
| 67 | FLAG_EMP_PHONE | -0.481760 | 0.481760 | red |
| 15 | EXT_SOURCE_3 | -0.463476 | 0.463476 | red |
We have performed Logistic regression without regularization and have got an Train accuracy and valid accuracy of ~92%. Based on the ROC Area Under Curve values for the Logistic regression model, the ROC value is 0.735, showing a large amount of True Positive values, indicating a good fit to the data.
x_test = df_test
y_test_pred_prob = baseline_model.predict_proba(x_test)[:,1].reshape(-1,1)
df_kaggle = np.concatenate((x_test.SK_ID_CURR.values.reshape(-1,1), y_test_pred_prob), axis=1)
df_kaggle = pd.DataFrame(df_kaggle, columns = ["SK_ID_CURR", "TARGET"])
df_kaggle = df_kaggle.astype({"SK_ID_CURR": int, "TARGET": float})
df_kaggle.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.072920 |
| 1 | 100005 | 0.135900 |
| 2 | 100013 | 0.037275 |
| 3 | 100028 | 0.020184 |
| 4 | 100038 | 0.123954 |
df_kaggle.to_csv("baseline_kaggle_sumbsission.csv", index=False, header = 1)
from IPython.display import Image
Image(filename='/content/Kaggle_Submission1.png')
The following steps will allow us to accomplish feature selection, engineering, and model selection:
Engineer features - Through our EDA we have identified features that may perform better if aggreated or combined into a ratio, we will add these features to our datasets before feature selection.
Select features - Using a Random Forest Classifer we will identify the importance of features across all datasets. After identifying feature importance, the top 50 most important features across all datasets will be selected to be used in our models.
Merger selected features & Subset our single dataframe to increase hyperparameter tuning performance -- Our merged training dataframe will be subset to increase performance through the hyperparameter tuning step. As the nature of the dataset is imbalanced, we will take special consideration to maintaining the correct proportion of target categories.
Create pipelines to control the flow of data
Categorical Pipeline: missing data will be imputed with the most frequent value and features will be one hot encoded.
Numeric Pipeline: missing data will be imputed with the mean value and features will be scaled.
Use Grid Search CV explore models to find optimal parameters in random forest, stochastic GD.
Evaluate efficacy of our selected model -- After finding an optimal model, the entire training dataset will be fitted and the performance of our optimal model will be evaluated.
from sklearn.ensemble import RandomForestClassifier
df_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_3 | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | 0 | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | Stone, brick | No | 2.0 | 2.0 | -1134.0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | Block | No | 1.0 | 1.0 | -828.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | 0 | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | NaN | NaN | 0.0 | 0.0 | -815.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | NaN | NaN | 2.0 | 2.0 | -617.0 | 1 | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | 0 | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | NaN | NaN | 0.0 | 0.0 | -1106.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 94 columns
Below are the new features engineered which are amonf the top 50 features which were used for training.
df_train['INC_AGE'] = df_train['AMT_INCOME_TOTAL'] / df_train['DAYS_BIRTH'] #Spending potential of oneselve
df_train['KIDS_FAM'] = df_train['CNT_CHILDREN'] / df_train['CNT_FAM_MEMBERS'] #Children per family, projecting approx heuristic for childcare expenses
df_train['LOAN_TO_VALUE'] = df_train['AMT_CREDIT'] / df_train['AMT_GOODS_PRICE'] #Risk Indicator
df_train['LOAN_FRAC'] = df_train['AMT_ANNUITY'] / df_train['AMT_INCOME_TOTAL'] #Fraction of total income, required to repay the loan
df_test['INC_AGE'] = df_test['AMT_INCOME_TOTAL'] / df_test['DAYS_BIRTH'] #Spending potential of oneselve
df_test['KIDS_FAM'] = df_test['CNT_CHILDREN'] / df_test['CNT_FAM_MEMBERS'] #Children per family, projecting approx heuristic for childcare expenses
df_test['LOAN_TO_VALUE'] = df_test['AMT_CREDIT'] / df_test['AMT_GOODS_PRICE'] #Risk Indicator
df_test['LOAN_FRAC'] = df_test['AMT_ANNUITY'] / df_test['AMT_INCOME_TOTAL'] #Fraction of total income, required to repay the loan
#DEBUGGGG!!!!!
df_debug = pd.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D'])
df_debug1 = df_debug.B
df_debug2 = df_debug["B"]
df_debug
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0 | 1 | 2 | 3 |
| 1 | 4 | 5 | 6 | 7 |
| 2 | 8 | 9 | 10 | 11 |
df_debug1
0 1 1 5 2 9 Name: B, dtype: int64
#df_debug2.columns = 'TARGET'
#df_debug2.index.name = 'TARGET'
df_debug2.values
array([1, 5, 9])
loan_def = df_train[df_train['TARGET']==1] #Defaulted
loan_pai = df_train[df_train['TARGET']==0] #!Defaulted
n = loan_def.shape[0] + loan_pai.shape[0]
loan_def_fr, loan_pai_fr = loan_def.shape[0]/n, loan_pai.shape[0]/n
print("% of default: {}%\n% of non-default: {}%".format(round(loan_def_fr*100, 2), round(loan_pai_fr*100,2)))
loan_def_sample, loan_pai_sample = loan_def.sample(n=int(10000*loan_def_fr)), loan_pai.sample(n=int(10000*loan_pai_fr))
loan_sample = pd.concat([loan_def_sample,loan_pai_sample])
% of default: 8.07% % of non-default: 91.93%
X = loan_sample.drop(["TARGET"], axis=1)
y = loan_sample.TARGET
if X.select_dtypes(include=[np.number]).any(axis=None):
num_data = X.select_dtypes(include=np.number)
num_data = num_data.fillna(num_data.mean())
else: num_data = X['SK_ID_CURR']
if X.select_dtypes(include=['object']).any(axis=None):
cat_list = X.select_dtypes(include=['object']).columns.tolist()
cat_list.append('SK_ID_CURR')
cat_data = X[cat_list]
cat_list.remove('SK_ID_CURR')
cat_data = cat_data.fillna(cat_data.mode().iloc[0])
cat_data = pd.get_dummies(cat_data, columns=cat_list)
else: cat_data = X['SK_ID_CURR']
X = pd.merge(num_data,cat_data,how='left', on=['SK_ID_CURR'])
ids = X['SK_ID_CURR'].to_list()
X = X.drop(['SK_ID_CURR'], axis=1)
y
277429 1
297260 1
287538 1
22819 1
231988 1
..
42117 0
1486 0
125537 0
52664 0
24932 0
Name: TARGET, Length: 9999, dtype: int64
del cat_data, num_data
X1, _, y1,_ = train_test_split(X,y, train_size = 0.1, stratify= y, random_state=1)
model = RandomForestClassifier()
model.fit(X1, y1)
importance = model.feature_importances_
importance_dic = {"Dataset":"application_train", "Feature":X.columns,"Importance":importance}
application_train_importance_df = pd.DataFrame(data=importance_dic)
print("FEATURE IMPORTANCE:\n",application_train_importance_df.sort_values(by=['Importance'], ascending=False))
application_train_importance_df.sort_values(by=['Importance'], ascending=False).head(25).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
FEATURE IMPORTANCE:
Dataset Feature Importance
78 application_train LOAN_TO_VALUE 0.040363
23 application_train EXT_SOURCE_2 0.040164
24 application_train EXT_SOURCE_3 0.037977
8 application_train DAYS_REGISTRATION 0.035499
6 application_train DAYS_BIRTH 0.031352
.. ... ... ...
174 application_train ORGANIZATION_TYPE_Mobile 0.000000
93 application_train NAME_TYPE_SUITE_Other_B 0.000000
176 application_train ORGANIZATION_TYPE_Police 0.000000
178 application_train ORGANIZATION_TYPE_Realtor 0.000000
215 application_train EMERGENCYSTATE_MODE_Yes 0.000000
[216 rows x 3 columns]
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a6ce77f0>
df_cc = pd.read_csv('credit_card_balance.csv')
df_cc = pd.concat([df_cc, pd.get_dummies(df_cc['NAME_CONTRACT_STATUS'], prefix='NAME_CONTRACT_STATUS')], axis=1)
df_cc = df_cc.drop(['NAME_CONTRACT_STATUS'], axis = 1)
df_cc['AMT_CREDIT_RATIO'] = df_cc['AMT_BALANCE']/(df_cc['AMT_CREDIT_LIMIT_ACTUAL'])
df_cc['MIN_PAYMENT_RATIO'] = df_cc['AMT_INST_MIN_REGULARITY']/(df_cc['AMT_PAYMENT_CURRENT'])
aggregate = {
'NAME_CONTRACT_STATUS_Active': 'sum', 'NAME_CONTRACT_STATUS_Completed': 'sum', 'NAME_CONTRACT_STATUS_Demand': 'sum',
'NAME_CONTRACT_STATUS_Signed': 'sum', 'NAME_CONTRACT_STATUS_Sent proposal': 'sum',
'NAME_CONTRACT_STATUS_Refused': 'sum', 'NAME_CONTRACT_STATUS_Approved': 'sum'}
cc_balance_fe = df_cc.groupby('SK_ID_CURR').agg(aggregate)
cc_balance_train = pd.merge(df_train[['SK_ID_CURR', 'TARGET']], cc_balance_fe,how='left',on=['SK_ID_CURR'])
df_train = df_train.join(cc_balance_fe, how = 'left', on =['SK_ID_CURR'])
df_test = df_test.join(cc_balance_fe, how = 'left', on =['SK_ID_CURR'])
app_sample = cc_balance_train[cc_balance_train['SK_ID_CURR'].isin(ids)==True]
app_sample.shape
(9999, 9)
del cc_balance_train, cc_balance_fe, df_cc
X = app_sample
y = app_sample.TARGET
X = app_sample.drop(["TARGET"], axis=1)
if X.select_dtypes(include=[np.number]).any(axis=None):
num_data = X.select_dtypes(include=np.number)
num_data = num_data.fillna(num_data.mean())
else: num_data = X['SK_ID_CURR']
if X.select_dtypes(include=['object']).any(axis=None):
cat_list = X.select_dtypes(include=['object']).columns.tolist().append('SK_ID_CURR')
cat_data = X[cat_list].remove('SK_ID_CURR')
cat_data = cat_data.fillna(cat_data.mode().iloc[0])
cat_data = pd.get_dummies(cat_data, columns=cat_list)
else: cat_data = X['SK_ID_CURR']
X = pd.merge(num_data,cat_data,how='left',on=['SK_ID_CURR'])
X = X.drop(["SK_ID_CURR"], axis=1)
from sklearn.datasets import make_classification
X1, _, y1,_ = train_test_split(X,y, train_size = 0.1, stratify= y, random_state=1)
model = RandomForestClassifier()
model.fit(X1, y1)
importance = model.feature_importances_
importance_dic = {"Dataset":"cc_train","Feature":X.columns,"Importance":importance}
cc_train_importance_df = pd.DataFrame(data=importance_dic)
print("FEATURE IMPORTANCE:\n",cc_train_importance_df.sort_values(by=['Importance'], ascending=False))
cc_train_importance_df.sort_values(by=['Importance'], ascending=False).head(25).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
FEATURE IMPORTANCE:
Dataset Feature Importance
0 cc_train NAME_CONTRACT_STATUS_Active 0.802020
1 cc_train NAME_CONTRACT_STATUS_Completed 0.147633
3 cc_train NAME_CONTRACT_STATUS_Signed 0.038312
4 cc_train NAME_CONTRACT_STATUS_Sent proposal 0.007285
5 cc_train NAME_CONTRACT_STATUS_Refused 0.004751
2 cc_train NAME_CONTRACT_STATUS_Demand 0.000000
6 cc_train NAME_CONTRACT_STATUS_Approved 0.000000
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a6cd4a60>
df_bureau = pd.read_csv('bureau.csv')
df_bureau_bal = pd.read_csv('bureau_balance.csv')
df_bureau_months = df_bureau_bal[['SK_ID_BUREAU', 'MONTHS_BALANCE']].groupby('SK_ID_BUREAU').agg('min')
df_last_DPD = df_bureau_bal[df_bureau_bal.STATUS.isin(['1','2','3','4','5'])].groupby(['SK_ID_BUREAU'])['MONTHS_BALANCE'].max()
df_last_DPD = df_last_DPD.rename('MONTH_LAST_DPD')
df_temp_stat = pd.get_dummies(df_bureau_bal['STATUS'], prefix='STATUS')
df_bureau_bal['Late_DPD'] = df_temp_stat['STATUS_1'] + df_temp_stat['STATUS_2'] + df_temp_stat['STATUS_3'] + df_temp_stat['STATUS_4'] + df_temp_stat['STATUS_5']
df_bureau_bal['closed_status'] = df_temp_stat['STATUS_C']
df_bureau_bal['unknow_status'] = df_temp_stat['STATUS_X']
df_status = df_bureau_bal[['SK_ID_BUREAU', 'Late_DPD', 'closed_status', 'unknow_status']].groupby(['SK_ID_BUREAU']).sum()
df_bureau_recent_c = df_bureau_bal[df_bureau_bal.STATUS=='C'].groupby(['SK_ID_BUREAU'])['MONTHS_BALANCE'].max()
df_bureau_recent_c.rename('MONTH_LAST_C',inplace=True).describe()
count 449604.000000 mean -6.146436 std 15.584140 min -96.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 0.000000 Name: MONTH_LAST_C, dtype: float64
df_bureau_stat = df_status.join(df_bureau_months, how = 'left').join(df_bureau_recent_c, how = 'left'). join(df_last_DPD, how = 'left')
df_bureau_stat.head()
| Late_DPD | closed_status | unknow_status | MONTHS_BALANCE | MONTH_LAST_C | MONTH_LAST_DPD | |
|---|---|---|---|---|---|---|
| SK_ID_BUREAU | ||||||
| 5001709 | 0 | 86 | 11 | -96 | 0.0 | NaN |
| 5001710 | 0 | 48 | 30 | -82 | 0.0 | NaN |
| 5001711 | 0 | 0 | 1 | -3 | NaN | NaN |
| 5001712 | 0 | 9 | 0 | -18 | 0.0 | NaN |
| 5001713 | 0 | 0 | 22 | -21 | NaN | NaN |
df_bureau_stat = df_bureau_stat.reset_index()
df_bureau_stat.head()
| SK_ID_BUREAU | Late_DPD | closed_status | unknow_status | MONTHS_BALANCE | MONTH_LAST_C | MONTH_LAST_DPD | |
|---|---|---|---|---|---|---|---|
| 0 | 5001709 | 0 | 86 | 11 | -96 | 0.0 | NaN |
| 1 | 5001710 | 0 | 48 | 30 | -82 | 0.0 | NaN |
| 2 | 5001711 | 0 | 0 | 1 | -3 | NaN | NaN |
| 3 | 5001712 | 0 | 9 | 0 | -18 | 0.0 | NaN |
| 4 | 5001713 | 0 | 0 | 22 | -21 | NaN | NaN |
def rename_credit_type(entry):
if 'loan' in entry.CREDIT_TYPE.lower(): return 'loan'
elif 'mortgage' in entry.CREDIT_TYPE.lower(): return 'loan'
elif 'credit' in entry.CREDIT_TYPE.lower(): return 'credit'
else: return entry.CREDIT_TYPE.lower()
df_bureau['credit_type'] = df_bureau.apply(rename_credit_type, axis =1)
df_bureau_active_credit = df_bureau[['SK_ID_CURR', 'CREDIT_ACTIVE']].groupby(['SK_ID_CURR', 'CREDIT_ACTIVE']).size().unstack(fill_value=0)
df_bureau_cur = df_bureau[['SK_ID_CURR', 'CREDIT_CURRENCY']].groupby(['SK_ID_CURR', 'CREDIT_CURRENCY']).size().unstack(fill_value=0)
df_bureau_type_credit = df_bureau[['SK_ID_CURR', 'credit_type']].groupby(['SK_ID_CURR', 'credit_type']).size().unstack(fill_value=0)
df_bureau_cat = df_bureau_active_credit.join(df_bureau_cur).join(df_bureau_type_credit)
df_bureau_cat = df_bureau_cat.fillna(0)
df_bureau['DAYS_CREDIT_ENDDATE'].loc[df_bureau['DAYS_CREDIT_ENDDATE'] < -40000] = np.nan
df_bureau['DAYS_CREDIT_UPDATE'].loc[df_bureau['DAYS_CREDIT_UPDATE'] < -40000] = np.nan
df_bureau['DAYS_ENDDATE_FACT'].loc[df_bureau['DAYS_ENDDATE_FACT'] < -40000] = np.nan
df_bureau['AMT_DEBT_RATIO'] = df_bureau['AMT_CREDIT_SUM_DEBT']/(df_bureau['AMT_CREDIT_SUM'])
df_bureau['AMT_LIMIT_RATIO'] = df_bureau['AMT_CREDIT_SUM_LIMIT']/(df_bureau['AMT_CREDIT_SUM'])
df_bureau['AMT_SUM_OVERDUE_RATIO'] = df_bureau['AMT_CREDIT_SUM_OVERDUE']/(df_bureau['AMT_CREDIT_SUM'])
df_bureau['AMT_MAX_OVERDUE_RATIO'] = df_bureau['AMT_CREDIT_MAX_OVERDUE']/(df_bureau['AMT_CREDIT_SUM'])
df_bureau['DAYS_END_DIFF'] = df_bureau['DAYS_ENDDATE_FACT'] - df_bureau['DAYS_CREDIT_ENDDATE']
idx = df_bureau.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].idxmax()
df_bureau_recent = df_bureau.loc[idx.values]
df_bureau = df_bureau.join(df_bureau_stat, on = 'SK_ID_BUREAU', how = 'left', rsuffix = '_balance')
aggregation = { 'Late_DPD':'sum', 'closed_status':'sum', 'unknow_status':'sum', 'MONTHS_BALANCE':['min', 'max'], 'MONTH_LAST_C':['min', 'max'], 'MONTH_LAST_DPD' : ['min', 'max']}
df_bal = df_bureau[['SK_ID_CURR','Late_DPD', 'closed_status', 'unknow_status', 'MONTHS_BALANCE', 'MONTH_LAST_C', 'MONTH_LAST_DPD']].groupby('SK_ID_CURR').agg(aggregation)
df_bureau_all = df_bureau_recent.join(df_bal, on = 'SK_ID_CURR', how = 'left').join(df_bureau_cat, on = 'SK_ID_CURR', how = 'left')
df_bureau_all.columns
Index([ 'SK_ID_CURR', 'SK_ID_BUREAU',
'CREDIT_ACTIVE', 'CREDIT_CURRENCY',
'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE',
'DAYS_CREDIT_ENDDATE', 'DAYS_ENDDATE_FACT',
'AMT_CREDIT_MAX_OVERDUE', 'CNT_CREDIT_PROLONG',
'AMT_CREDIT_SUM', 'AMT_CREDIT_SUM_DEBT',
'AMT_CREDIT_SUM_LIMIT', 'AMT_CREDIT_SUM_OVERDUE',
'CREDIT_TYPE', 'DAYS_CREDIT_UPDATE',
'AMT_ANNUITY', 'credit_type',
'AMT_DEBT_RATIO', 'AMT_LIMIT_RATIO',
'AMT_SUM_OVERDUE_RATIO', 'AMT_MAX_OVERDUE_RATIO',
'DAYS_END_DIFF', ('Late_DPD', 'sum'),
('closed_status', 'sum'), ('unknow_status', 'sum'),
('MONTHS_BALANCE', 'min'), ('MONTHS_BALANCE', 'max'),
('MONTH_LAST_C', 'min'), ('MONTH_LAST_C', 'max'),
('MONTH_LAST_DPD', 'min'), ('MONTH_LAST_DPD', 'max'),
'Active', 'Bad debt',
'Closed', 'Sold',
'currency 1', 'currency 2',
'currency 3', 'currency 4',
'credit', 'loan'],
dtype='object')
cols = ['SK_ID_CURR', 'DAYS_CREDIT', 'CREDIT_DAY_OVERDUE', 'AMT_CREDIT_MAX_OVERDUE','CNT_CREDIT_PROLONG', 'DAYS_CREDIT_UPDATE',
'AMT_ANNUITY', 'AMT_DEBT_RATIO', 'AMT_LIMIT_RATIO', 'AMT_SUM_OVERDUE_RATIO', 'AMT_MAX_OVERDUE_RATIO', 'DAYS_END_DIFF', ('Late_DPD', 'sum'),
('closed_status', 'sum'), ('unknow_status', 'sum'), ('MONTHS_BALANCE', 'min'), ('MONTHS_BALANCE', 'max'),
('MONTH_LAST_C', 'min'), ('MONTH_LAST_C', 'max'), ('MONTH_LAST_DPD', 'min'), ('MONTH_LAST_DPD', 'max'), 'Active',
'Bad debt', 'Closed', 'Sold', 'currency 1', 'currency 2', 'currency 3', 'currency 4', 'credit', 'loan']
df_bureau_all = df_bureau_all[col]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3360 try: -> 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() /usr/local/lib/python3.8/dist-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'AMT_REQ_CREDIT_BUREAU_YEAR' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) <ipython-input-96-9adc7fee1236> in <module> 4 ('MONTH_LAST_C', 'min'), ('MONTH_LAST_C', 'max'), ('MONTH_LAST_DPD', 'min'), ('MONTH_LAST_DPD', 'max'), 'Active', 5 'Bad debt', 'Closed', 'Sold', 'currency 1', 'currency 2', 'currency 3', 'currency 4', 'credit', 'loan'] ----> 6 df_bureau_all = df_bureau_all[col] /usr/local/lib/python3.8/dist-packages/pandas/core/frame.py in __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459 if is_integer(indexer): 3460 indexer = [indexer] /usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance) 3361 return self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not self.hasnans: KeyError: 'AMT_REQ_CREDIT_BUREAU_YEAR'
df_train = df_train.join(df_bureau_all, on = 'SK_ID_CURR', how = 'left', rsuffix = '_bureau')
df_test = df_test.join(df_bureau_all, on = 'SK_ID_CURR', how = 'left', rsuffix = '_bureau')
df_train_b = df_train[['SK_ID_CURR', 'TARGET']].join(df_bureau_all.set_index('SK_ID_CURR'), on = 'SK_ID_CURR', how = 'left', rsuffix='_bureau')
app_sample = df_train_b[df_train_b['SK_ID_CURR'].isin(ids)==True]
app_sample.shape
(9999, 43)
y = app_sample.TARGET
X = app_sample.drop(["TARGET"], axis=1)
if X.select_dtypes(include=[np.number]).any(axis=None):
num_data = X.select_dtypes(include=np.number)
num_data = num_data.fillna(num_data.mean())
else: num_data = X['SK_ID_CURR']
if X.select_dtypes(include=['object']).any(axis=None):
cat_list = X.select_dtypes(include=['object']).columns.tolist()
cat_list.append('SK_ID_CURR')
cat_data = X[cat_list]
cat_list.remove('SK_ID_CURR')
cat_data = cat_data.fillna(cat_data.mode().iloc[0])
cat_data = pd.get_dummies(cat_data, columns=cat_list)
else: cat_data = X['SK_ID_CURR']
X = pd.merge(num_data,cat_data,how='left',on=['SK_ID_CURR'])
X = X.drop(["SK_ID_CURR"], axis=1)
X = X.fillna(0)
X.replace([np.inf, -np.inf], 0, inplace=True)
X1, _, y1,_ = train_test_split(X,y, train_size = 0.15, stratify= y, random_state=1)
model = RandomForestClassifier()
model.fit(X1, y1)
importance = model.feature_importances_
importance_dic = {"Dataset":"bureau_merged_train","Feature":X.columns,"Importance":importance}
df_bureau_train_imp = pd.DataFrame(data=importance_dic)
print("FEATURE IMPORTANCE:\n",df_bureau_train_imp.sort_values(by=['Importance'], ascending=False))
df_bureau_train_imp.sort_values(by=['Importance'], ascending=False).head(25).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
FEATURE IMPORTANCE:
Dataset Feature \
1 bureau_merged_train DAYS_CREDIT
3 bureau_merged_train DAYS_CREDIT_ENDDATE
11 bureau_merged_train DAYS_CREDIT_UPDATE
7 bureau_merged_train AMT_CREDIT_SUM
0 bureau_merged_train SK_ID_BUREAU
13 bureau_merged_train AMT_DEBT_RATIO
8 bureau_merged_train AMT_CREDIT_SUM_DEBT
31 bureau_merged_train currency 1
29 bureau_merged_train Closed
27 bureau_merged_train Active
35 bureau_merged_train credit
12 bureau_merged_train AMT_ANNUITY
5 bureau_merged_train AMT_CREDIT_MAX_OVERDUE
9 bureau_merged_train AMT_CREDIT_SUM_LIMIT
4 bureau_merged_train DAYS_ENDDATE_FACT
17 bureau_merged_train DAYS_END_DIFF
14 bureau_merged_train AMT_LIMIT_RATIO
36 bureau_merged_train loan
16 bureau_merged_train AMT_MAX_OVERDUE_RATIO
46 bureau_merged_train CREDIT_TYPE_Consumer credit
47 bureau_merged_train CREDIT_TYPE_Credit card
40 bureau_merged_train CREDIT_ACTIVE_Sold
30 bureau_merged_train Sold
37 bureau_merged_train CREDIT_ACTIVE_Active
50 bureau_merged_train CREDIT_TYPE_Microloan
39 bureau_merged_train CREDIT_ACTIVE_Closed
54 bureau_merged_train credit_type_loan
53 bureau_merged_train credit_type_credit
51 bureau_merged_train CREDIT_TYPE_Mortgage
45 bureau_merged_train CREDIT_TYPE_Car loan
28 bureau_merged_train Bad debt
10 bureau_merged_train AMT_CREDIT_SUM_OVERDUE
2 bureau_merged_train CREDIT_DAY_OVERDUE
15 bureau_merged_train AMT_SUM_OVERDUE_RATIO
33 bureau_merged_train currency 3
6 bureau_merged_train CNT_CREDIT_PROLONG
32 bureau_merged_train currency 2
34 bureau_merged_train currency 4
24 bureau_merged_train (MONTH_LAST_C, max)
38 bureau_merged_train CREDIT_ACTIVE_Bad debt
52 bureau_merged_train CREDIT_TYPE_Unknown type of loan
26 bureau_merged_train (MONTH_LAST_DPD, max)
18 bureau_merged_train (Late_DPD, sum)
49 bureau_merged_train CREDIT_TYPE_Loan for working capital replenish...
48 bureau_merged_train CREDIT_TYPE_Loan for business development
19 bureau_merged_train (closed_status, sum)
20 bureau_merged_train (unknow_status, sum)
21 bureau_merged_train (MONTHS_BALANCE, min)
44 bureau_merged_train CREDIT_TYPE_Another type of loan
25 bureau_merged_train (MONTH_LAST_DPD, min)
42 bureau_merged_train CREDIT_CURRENCY_currency 2
41 bureau_merged_train CREDIT_CURRENCY_currency 1
22 bureau_merged_train (MONTHS_BALANCE, max)
23 bureau_merged_train (MONTH_LAST_C, min)
43 bureau_merged_train CREDIT_CURRENCY_currency 3
Importance
1 0.101760
3 0.094476
11 0.090071
7 0.088895
0 0.082249
13 0.065622
8 0.063717
31 0.048769
29 0.047333
27 0.045883
35 0.043447
12 0.043290
5 0.029690
9 0.025955
4 0.024451
17 0.018485
14 0.014556
36 0.013404
16 0.009919
46 0.009724
47 0.007823
40 0.005620
30 0.004630
37 0.004606
50 0.003630
39 0.003604
54 0.002308
53 0.002261
51 0.000788
45 0.000669
28 0.000436
10 0.000402
2 0.000377
15 0.000300
33 0.000289
6 0.000193
32 0.000189
34 0.000181
24 0.000000
38 0.000000
52 0.000000
26 0.000000
18 0.000000
49 0.000000
48 0.000000
19 0.000000
20 0.000000
21 0.000000
44 0.000000
25 0.000000
42 0.000000
41 0.000000
22 0.000000
23 0.000000
43 0.000000
<matplotlib.axes._subplots.AxesSubplot at 0x7fe2a69067c0>
df_ip = pd.read_csv('installments_payments.csv')
aggregation = {'AMT_PAYMENT':['min', 'max', 'mean'],
'AMT_INSTALMENT' : ['min', 'max', 'mean']}
df_install = df_ip.groupby('SK_ID_CURR').agg(aggregation)
df_ip_tr = df_train[['SK_ID_CURR','TARGET']].join(df_install, how = 'left')
df_train = df_train.join(df_install, on = 'SK_ID_CURR', how = 'left')
df_test = df_test.join(df_install, on = 'SK_ID_CURR', how = 'left')
del df_install, df_ip
app_sample = df_ip_tr[df_ip_tr['SK_ID_CURR'].isin(ids)]
app_sample.shape
(9999, 8)
y = app_sample.TARGET
X = app_sample.drop(["TARGET"], axis=1)
if X.select_dtypes(include=[np.number]).any(axis=None):
num_data = X.select_dtypes(include=np.number)
num_data = num_data.fillna(num_data.mean())
else: num_data = X['SK_ID_CURR']
if X.select_dtypes(include=['object']).any(axis=None):
cat_list = X.select_dtypes(include=['object']).columns.tolist()
cat_list.append('SK_ID_CURR')
cat_data = X[cat_list]
cat_list.remove('SK_ID_CURR')
cat_data = cat_data.fillna(cat_data.mode().iloc[0])
cat_data = pd.get_dummies(cat_data, columns=cat_list)
else: cat_data = X['SK_ID_CURR']
X = pd.merge(num_data,cat_data,how='left',on=['SK_ID_CURR'])
X = X.drop(["SK_ID_CURR"], axis=1)
X1, _, y1,_ = train_test_split(X,y, train_size = 0.15, stratify= y, random_state=1)
model = RandomForestClassifier()
model.fit(X1, y1)
importance = model.feature_importances_
importance_dic = {"Dataset":"installments_train","Feature":X.columns,"Importance":importance}
installments_train_importance_df = pd.DataFrame(data=importance_dic)
print("FEATURE IMPORTANCE:\n",installments_train_importance_df.sort_values(by=['Importance'], ascending=False))
installments_train_importance_df.sort_values(by=['Importance'], ascending=False).head(25).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
FEATURE IMPORTANCE:
Dataset Feature Importance
5 installments_train (AMT_INSTALMENT, mean) 0.179301
2 installments_train (AMT_PAYMENT, mean) 0.172085
4 installments_train (AMT_INSTALMENT, max) 0.169536
1 installments_train (AMT_PAYMENT, max) 0.161424
0 installments_train (AMT_PAYMENT, min) 0.160936
3 installments_train (AMT_INSTALMENT, min) 0.156718
<matplotlib.axes._subplots.AxesSubplot at 0x7fe27b78aa90>
df_cb = pd.read_csv('POS_CASH_balance.csv')
df_cb = pd.concat([df_cb, pd.get_dummies(df_cb['NAME_CONTRACT_STATUS'], prefix='NAME_CONTRACT_STATUS')], axis=1)
df_cb = df_cb.drop(['NAME_CONTRACT_STATUS'], axis = 1)
aggregate = {
'SK_ID_PREV': ['count'], 'CNT_INSTALMENT': ['sum', 'max', 'mean'],
'CNT_INSTALMENT_FUTURE': ['sum', 'max', 'mean'], 'NAME_CONTRACT_STATUS_Approved': 'sum',
'NAME_CONTRACT_STATUS_Canceled': 'sum', 'NAME_CONTRACT_STATUS_Completed': 'sum',
'NAME_CONTRACT_STATUS_Demand': 'sum', 'NAME_CONTRACT_STATUS_Returned to the store': 'sum',
'NAME_CONTRACT_STATUS_Signed': 'sum', 'NAME_CONTRACT_STATUS_XNA': 'sum',
'SK_DPD': ['sum', 'mean'], 'SK_DPD_DEF': ['sum', 'mean']}
df_cb_fe = df_cb.groupby('SK_ID_CURR').agg(aggregate)
df_cb_tr= df_train[['SK_ID_CURR','TARGET']].join(df_cb_fe, how = 'left')
df_train = df_train.join(df_cb_fe, on = 'SK_ID_CURR', how = 'left')
df_test = df_test.join(df_cb_fe, on = 'SK_ID_CURR', how = 'left')
X = df_cb_tr[df_cb_tr['SK_ID_CURR'].isin(ids)]
y = app_sample.TARGET
X = app_sample.drop(["TARGET"], axis=1)
if X.select_dtypes(include=[np.number]).any(axis=None):
num_data = X.select_dtypes(include=np.number)
num_data = num_data.fillna(num_data.mean())
else: num_data = X['SK_ID_CURR']
if X.select_dtypes(include=['object']).any(axis=None):
cat_list = X.select_dtypes(include=['object']).columns.tolist()
cat_list.append('SK_ID_CURR')
cat_data = X[cat_list]
cat_list.remove('SK_ID_CURR')
cat_data = cat_data.fillna(cat_data.mode().iloc[0])
cat_data = pd.get_dummies(cat_data, columns=cat_list)
else: cat_data = X['SK_ID_CURR']
X = pd.merge(num_data,cat_data,how='left',on=['SK_ID_CURR'])
X = X.drop(["SK_ID_CURR"], axis=1)
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot
X1, _, y1,_ = train_test_split(X,y, train_size = 0.15, stratify= y, random_state=1)
model = RandomForestClassifier()
model.fit(X1, y1)
importance = model.feature_importances_
df_imp_dc = {"Dataset":"pos_train","Feature":X.columns,"Importance":importance}
df_cb_tr_imp = pd.DataFrame(data=importance_dic)
print("FEATURE IMPORTANCE:\n",df_cb_tr_imp.sort_values(by=['Importance'], ascending=False))
df_cb_tr_imp.sort_values(by=['Importance'], ascending=False).head(25).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
FEATURE IMPORTANCE:
Dataset Feature Importance
5 installments_train (AMT_INSTALMENT, mean) 0.179301
2 installments_train (AMT_PAYMENT, mean) 0.172085
4 installments_train (AMT_INSTALMENT, max) 0.169536
1 installments_train (AMT_PAYMENT, max) 0.161424
0 installments_train (AMT_PAYMENT, min) 0.160936
3 installments_train (AMT_INSTALMENT, min) 0.156718
<matplotlib.axes._subplots.AxesSubplot at 0x7fe27ea14610>
df_imp_attributes = pd.concat([application_train_importance_df, cc_train_importance_df, df_bureau_train_imp,
installments_train_importance_df, df_cb_tr_imp])
df_imp_attributes = df_imp_attributes[df_imp_attributes['Feature'].str.contains("SK_ID_PREV")==False]
best_feat = df_imp_attributes.sort_values(by=['Importance'], ascending=False).head(50)
print("TOP 50 Features:\n",best_feat)
best_feat.sort_values(by=['Importance'], ascending=False).head(50).plot(x="Feature", y="Importance", kind="bar", figsize=(15,10))
TOP 50 Features:
Dataset Feature Importance
0 cc_train NAME_CONTRACT_STATUS_Active 0.802020
1 cc_train NAME_CONTRACT_STATUS_Completed 0.147633
1 bureau_merged_train DAYS_CREDIT 0.101760
3 bureau_merged_train DAYS_CREDIT_ENDDATE 0.094476
11 bureau_merged_train DAYS_CREDIT_UPDATE 0.090071
7 bureau_merged_train AMT_CREDIT_SUM 0.088895
0 bureau_merged_train SK_ID_BUREAU 0.082249
13 bureau_merged_train AMT_DEBT_RATIO 0.065622
8 bureau_merged_train AMT_CREDIT_SUM_DEBT 0.063717
31 bureau_merged_train currency 1 0.048769
29 bureau_merged_train Closed 0.047333
27 bureau_merged_train Active 0.045883
35 bureau_merged_train credit 0.043447
12 bureau_merged_train AMT_ANNUITY 0.043290
78 application_train LOAN_TO_VALUE 0.040363
23 application_train EXT_SOURCE_2 0.040164
3 cc_train NAME_CONTRACT_STATUS_Signed 0.038312
24 application_train EXT_SOURCE_3 0.037977
8 application_train DAYS_REGISTRATION 0.035499
6 application_train DAYS_BIRTH 0.031352
5 bureau_merged_train AMT_CREDIT_MAX_OVERDUE 0.029690
9 application_train DAYS_ID_PUBLISH 0.029223
9 bureau_merged_train AMT_CREDIT_SUM_LIMIT 0.025955
76 application_train INC_AGE 0.025708
79 application_train LOAN_FRAC 0.025444
7 application_train DAYS_EMPLOYED 0.024996
70 application_train DAYS_LAST_PHONE_CHANGE 0.024557
4 bureau_merged_train DAYS_ENDDATE_FACT 0.024451
19 application_train HOUR_APPR_PROCESS_START 0.022943
3 application_train AMT_ANNUITY 0.022894
1 application_train AMT_INCOME_TOTAL 0.022546
5 application_train REGION_POPULATION_RELATIVE 0.019093
17 bureau_merged_train DAYS_END_DIFF 0.018485
2 application_train AMT_CREDIT 0.017902
4 application_train AMT_GOODS_PRICE 0.016735
22 application_train EXT_SOURCE_1 0.014900
14 bureau_merged_train AMT_LIMIT_RATIO 0.014556
36 bureau_merged_train loan 0.013404
75 application_train AMT_REQ_CREDIT_BUREAU_YEAR 0.012696
68 application_train OBS_30_CNT_SOCIAL_CIRCLE 0.012337
10 application_train OWN_CAR_AGE 0.011341
16 bureau_merged_train AMT_MAX_OVERDUE_RATIO 0.009919
16 application_train CNT_FAM_MEMBERS 0.009890
180 application_train ORGANIZATION_TYPE_Restaurant 0.009889
46 bureau_merged_train CREDIT_TYPE_Consumer credit 0.009724
69 application_train OBS_60_CNT_SOCIAL_CIRCLE 0.009564
74 application_train AMT_REQ_CREDIT_BUREAU_QRT 0.009015
73 application_train AMT_REQ_CREDIT_BUREAU_MON 0.008541
39 application_train APARTMENTS_MODE 0.008421
64 application_train LIVINGAREA_MEDI 0.008272
<matplotlib.axes._subplots.AxesSubplot at 0x7fe27ba54580>
The above graph is used to visualize the importance of top 50 features from the data. These top 50 features were further used to train our models.
best_feat = best_feat.reset_index()
features = list(best_feat.iloc[:50].Feature)
features.append('SK_ID_CURR')
features.append('TARGET')
features.remove('ORGANIZATION_TYPE_Restaurant')
features.remove('CREDIT_TYPE_Consumer credit')
#features.remove('WEEKDAY_APPR_PROCESS_START_MONDAY')
merged_train_subset = df_train[features].copy()
features.remove('TARGET')
merged_test_subset = df_test[features].copy()
#Save checkpoint
merged_train_subset.to_csv('merged_train_subset.csv')
merged_test_subset.to_csv('merged_test_subset.csv')
# Uncomment while loading from above checkpoint
merged_train_subset = pd.read_csv('merged_train_subset.csv')
merged_test_subset = pd.read_csv('merged_test_subset.csv')
merged_train_subset = merged_train_subset.drop('Unnamed: 0', axis = 1)
merged_test_subset = merged_test_subset.drop('Unnamed: 0', axis = 1)
merged_train_subset = merged_train_subset.replace([float('inf')],np.nan)
merged_train_subset = merged_train_subset.replace([float('-inf')],np.nan)
merged_test_subset = merged_test_subset.replace([float('inf')],np.nan)
merged_test_subset = merged_test_subset.replace([float('-inf')],np.nan)
loan_def = merged_train_subset[merged_train_subset['TARGET']==1]
loan_pai = merged_train_subset[merged_train_subset['TARGET']==0]
loan_def_len = merged_train_subset[merged_train_subset['TARGET']==1].shape[0]
loan_pai_len = merged_train_subset[merged_train_subset['TARGET']==0].shape[0]
n = merged_train_subset.shape[0]
loan_def_len_fr = loan_def_len/n
loan_pai_len_fr = loan_pai_len/n
print("{}% of applicatnts are given a loan".format(np.round(loan_def_len_fr*100,0)))
print("{}% of applicatnts are not given a loan".format(np.round(loan_pai_len_fr*100,0)))
8.0% of applicatnts are given a loan 92.0% of applicatnts are not given a loan
#Creating a subset containing 50,000 rows of training data
loan_app_sample = loan_def.sample(n=int(50000*loan_def_len_fr))
loan_dis_sample = loan_pai.sample(n=int(50000*loan_pai_len_fr))
app_sample = pd.concat([loan_app_sample, loan_dis_sample])
app_sample.head()
app_sample.to_csv('app_sample.csv')
X = app_sample.drop(["TARGET","SK_ID_CURR"], axis=1)
y = app_sample.TARGET
cat_features = X.select_dtypes(include=['object']).columns.tolist()
num_features = X.select_dtypes(include=np.number).columns.tolist()
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
cat_pipe = Pipeline([
('selector', DataFrameSelector(cat_features)),
('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))])
num_pipe = Pipeline([
('selector', DataFrameSelector(num_features)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler())])
data_prep_pipe = ColumnTransformer(transformers= [
("num_pipeline", num_pipe, num_features),
("cat_pipeline", cat_pipe, cat_features)],
remainder='drop',
n_jobs=-1
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
lr_pipe = Pipeline([("data_prep", data_prep_pipe), ("lr", LogisticRegression())])
params = {'lr__C':[1.0, 10.0, 100.0, 1000.0], 'lr__penalty':['none', 'l1','l2']}
lr_gridsearch = GridSearchCV(lr_pipe, param_grid = params, cv = 3, scoring='accuracy')
from time import time
print("Performing grid search...")
print("pipeline:", [name for name, _ in lr_pipe.steps])
print("parameters:")
print(params)
tnot = time()
lr_gridsearch.fit(X_train, y_train)
print("Time taken: %0.3fs" % (time() - tnot))
print("Best parameters set found on development set:")
print(lr_gridsearch.best_params_)
print("Grid scores on development set:")
means = lr_gridsearch.cv_results_['mean_test_score']
stds = lr_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, lr_gridsearch.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
print("\nBest %s score: %0.3f" %(scoring, lr_gridsearch.best_score_))
print("\nBest parameters set:")
best_parameters = lr_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
sortedGridSearchResults = sorted(zip(lr_gridsearch.cv_results_["params"], lr_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
print(f'Top 2 GridSearch results: ({scoring}, hyperparam Combo)\n {sortedGridSearchResults[0]}\n {sortedGridSearchResults[1]}\n\n\n')
y_pred_train = lr_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = lr_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")
Performing grid search...
pipeline: ['data_prep', 'lr']
parameters:
{'lr__C': [1.0, 10.0, 100.0, 1000.0], 'lr__penalty': ['none', 'l1', 'l2']}
Time taken: 30.751s
Best parameters set found on development set:
{'lr__C': 10.0, 'lr__penalty': 'l2'}
Grid scores on development set:
0.920 (+/-0.000) for {'lr__C': 1.0, 'lr__penalty': 'none'}
nan (+/-nan) for {'lr__C': 1.0, 'lr__penalty': 'l1'}
0.920 (+/-0.000) for {'lr__C': 1.0, 'lr__penalty': 'l2'}
0.920 (+/-0.000) for {'lr__C': 10.0, 'lr__penalty': 'none'}
nan (+/-nan) for {'lr__C': 10.0, 'lr__penalty': 'l1'}
0.920 (+/-0.000) for {'lr__C': 10.0, 'lr__penalty': 'l2'}
0.920 (+/-0.000) for {'lr__C': 100.0, 'lr__penalty': 'none'}
nan (+/-nan) for {'lr__C': 100.0, 'lr__penalty': 'l1'}
0.920 (+/-0.000) for {'lr__C': 100.0, 'lr__penalty': 'l2'}
0.920 (+/-0.000) for {'lr__C': 1000.0, 'lr__penalty': 'none'}
nan (+/-nan) for {'lr__C': 1000.0, 'lr__penalty': 'l1'}
0.920 (+/-0.000) for {'lr__C': 1000.0, 'lr__penalty': 'l2'}
Best accuracy score: 0.920
Best parameters set:
lr__C: 10.0
lr__penalty: 'l2'
Top 2 GridSearch results: (accuracy, hyperparam Combo)
({'lr__C': 1.0, 'lr__penalty': 'none'}, 0.9196729918247956)
({'lr__C': 1.0, 'lr__penalty': 'l1'}, nan)
Training accuracy by best pipeline is: 0.920
Testing accuracy by best pipeline is: 0.918
y_valid_pred = lr_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (validation data)")
print(confusion_matrix(y_valid_pred, y_test))
print("------------------")
print(f"Overall accuracy (validation data): {np.round(accuracy_score(y_test, y_valid_pred), 3)*100}%")
print("------------------")
print(f"AUROC (validation data): {np.round(roc_auc_score(y_test, y_valid_pred), 3)*100}%")
print()
RocCurveDisplay.from_estimator(lr_gridsearch.best_estimator_, X_test, y_test)
plt.show()
Confusion matrix (validation data) [[9170 819] [ 4 7]] ------------------ Overall accuracy (validation data): 91.8% ------------------ AUROC (validation data): 50.4%
expLog = pd.DataFrame(columns=["model_name",
"Train Acc",
"Valid Acc",
"Train AUC",
"Valid AUC",
"Comment (optional)"])
expLog.loc[len(expLog)] = ["Baseline model: Logistic regression"] + list(np.round(
[accuracy_score(y_train, lr_gridsearch.best_estimator_.predict(X_train)),
accuracy_score(y_test, lr_gridsearch.best_estimator_.predict(X_test)),
roc_auc_score(y_train, lr_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
roc_auc_score(y_test, lr_gridsearch.best_estimator_.predict_proba(X_test)[:, 1])], 3)) + [
""]
expLog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | Baseline model: Logistic regression | 0.92 | 0.918 | 0.729 | 0.733 |
rf_pipe = Pipeline([
("data_prep", data_prep_pipe),
('RandomForest', RandomForestClassifier(random_state=42))])
params = {'RandomForest__max_depth':[10,30,50],
'RandomForest__max_features':[10,15,20],
'RandomForest__min_samples_split':[5,10,15],
'RandomForest__min_samples_leaf':[10,25,50],
'RandomForest__bootstrap': [False],
'RandomForest__n_estimators':[20,80,150]}
params = {'RandomForest__max_depth':[10,50],
'RandomForest__max_features':[10,20],
'RandomForest__min_samples_split':[5,10,15],
'RandomForest__min_samples_leaf':[10,25],
'RandomForest__bootstrap': [False],
'RandomForest__n_estimators':[20,50,80]}
rf_gridsearch = GridSearchCV(rf_pipe, param_grid = params, cv = 3, scoring='accuracy', n_jobs = -1)
print("Gridsearch using Random Forest")
print("pipeline:", [name for name, _ in rf_pipe.steps])
print("parameters:")
print(params)
t0 = time()
rf_gridsearch.fit(X_train, y_train)
print("done in %0.3fs" % (time() - t0))
print("Best parameters set found on development set:")
print(rf_gridsearch.best_params_)
print("Grid scores on development set:")
means = rf_gridsearch.cv_results_['mean_test_score']
stds = rf_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, rf_gridsearch.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
print("Best %s score: %0.3f" %(scoring, rf_gridsearch.best_score_))
print("Best parameters set:")
best_parameters = rf_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
sortedGridSearchResults = sorted(zip(rf_gridsearch.cv_results_["params"], rf_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
print(f'Top 2 GridSearch results: ({scoring}, hyperparam Combo)\n {sortedGridSearchResults[0]}\n {sortedGridSearchResults[1]}\n\n\n')
y_pred_train = rf_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = rf_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")
Gridsearch using Random Forest
pipeline: ['data_prep', 'RandomForest']
parameters:
{'RandomForest__max_depth': [10, 50], 'RandomForest__max_features': [10, 20], 'RandomForest__min_samples_split': [5, 10, 15], 'RandomForest__min_samples_leaf': [10, 25], 'RandomForest__bootstrap': [False], 'RandomForest__n_estimators': [20, 50, 80]}
done in 1518.234s
Best parameters set found on development set:
{'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
Grid scores on development set:
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 10, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.919 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.001) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 10, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 80}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 20}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 50}
0.920 (+/-0.000) for {'RandomForest__bootstrap': False, 'RandomForest__max_depth': 50, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 15, 'RandomForest__n_estimators': 80}
Best accuracy score: 0.920
Best parameters set:
RandomForest__bootstrap: False
RandomForest__max_depth: 10
RandomForest__max_features: 20
RandomForest__min_samples_leaf: 25
RandomForest__min_samples_split: 5
RandomForest__n_estimators: 50
Top 2 GridSearch results: (accuracy, hyperparam Combo)
({'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 5, 'RandomForest__n_estimators': 50}, 0.9198229955748894)
({'RandomForest__bootstrap': False, 'RandomForest__max_depth': 10, 'RandomForest__max_features': 20, 'RandomForest__min_samples_leaf': 25, 'RandomForest__min_samples_split': 10, 'RandomForest__n_estimators': 50}, 0.9198229955748894)
Training accuracy by best pipeline is: 0.920
Testing accuracy by best pipeline is: 0.917
y_valid_pred = rf_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (validation data)")
print(confusion_matrix(y_valid_pred, y_test))
print("------------------")
print(f"Overall accuracy (validation data): {np.round(accuracy_score(y_test, y_valid_pred), 3)*100}%")
print("------------------")
print(f"AUROC (validation data): {np.round(roc_auc_score(y_test, y_valid_pred), 3)*100}%")
RocCurveDisplay.from_estimator(rf_gridsearch.best_estimator_, X_test, y_test)
plt.show()
Confusion matrix (validation data) [[9174 825] [ 0 1]] ------------------ Overall accuracy (validation data): 91.8% ------------------ AUROC (validation data): 50.1%
expLog.loc[len(expLog)] = ["Baseline model: Random Forest Model"] + list(np.round(
[accuracy_score(y_train, rf_gridsearch.best_estimator_.predict(X_train)),
accuracy_score(y_test, rf_gridsearch.best_estimator_.predict(X_test)),
roc_auc_score(y_train, rf_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
roc_auc_score(y_test, rf_gridsearch.best_estimator_.predict_proba(X_test)[:, 1])], 3)) + [
""]
expLog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | Baseline model: Logistic regression | 0.92 | 0.918 | 0.729 | 0.733 | |
| 1 | Baseline model: Random Forest Model | 0.92 | 0.918 | 0.859 | 0.725 |
X_test = merged_test_subset
y_test_pred_proba = rf_gridsearch.predict_proba(X_test)[:,1].reshape(-1,1)
df_kaggle = np.concatenate((X_test.SK_ID_CURR.values.reshape(-1,1), y_test_pred_proba), axis=1)
df_kaggle = pd.DataFrame(df_kaggle, columns = ["SK_ID_CURR", "TARGET"])
df_kaggle = df_kaggle.astype({"SK_ID_CURR": int, "TARGET": float})
df_kaggle.head()
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.118989 |
| 1 | 100005 | 0.096274 |
| 2 | 100013 | 0.029102 |
| 3 | 100028 | 0.038424 |
| 4 | 100038 | 0.146930 |
We have achieved a training accuracy of 92% and AUC score of 0.73 on test data.
from sklearn.linear_model import SGDClassifier
sgd_pipe = Pipeline([
("data_prep", data_prep_pipe),
#('feature_selection', SequentialFeatureSelector(LogisticRegression(),direction='forward',n_features_to_select=15)),
('StochasticGD', SGDClassifier(random_state=42))])
params = {'StochasticGD__loss':['log'],
'StochasticGD__penalty':['l1', 'l2', 'elasticnet'],
'StochasticGD__tol':[0.0001, 0.00001, 0.0000001],
'StochasticGD__alpha':[0.1, 0.01, 0.001, 0.0001]}
sgd_gridsearch = GridSearchCV(sgd_pipe, param_grid = params, cv = 3, scoring='accuracy')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Gridsearch:")
print("Pipeline:", [name for name, _ in sgd_pipe.steps])
print("Parameters:")
print(params)
t0 = time()
sgd_gridsearch.fit(X_train, y_train)
print("Took %0.3fs to complete" % (time() - t0))
print("Best parameters set found on development set:")
print(sgd_gridsearch.best_params_)
print("Grid scores on development set:")
means = sgd_gridsearch.cv_results_['mean_test_score']
stds = sgd_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, sgd_gridsearch.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
print("Best %s score: %0.3f" %(scoring, sgd_gridsearch.best_score_))
print("Best parameters set:")
best_parameters = sgd_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()):print("\t%s: %r" % (param_name, best_parameters[param_name]))
sortedGridSearchResults = sorted(zip(sgd_gridsearch.cv_results_["params"], sgd_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
print(f'Top 2 GridSearch results: ({scoring}, hyperparam Combo)\n {sortedGridSearchResults[0]}\n {sortedGridSearchResults[1]}\n\n\n')
y_pred_train = sgd_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = sgd_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")
Gridsearch:
Pipeline: ['data_prep', 'StochasticGD']
Parameters:
{'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-07}
Took 85.919s to complete
Best parameters set found on development set:
{'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}
Grid scores on development set:
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.01, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.000) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-07}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 0.0001}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-05}
0.920 (+/-0.001) for {'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-07}
0.918 (+/-0.004) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 0.0001}
0.918 (+/-0.004) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-05}
0.918 (+/-0.004) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}
0.919 (+/-0.002) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 0.0001}
0.919 (+/-0.002) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-05}
0.919 (+/-0.002) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l2', 'StochasticGD__tol': 1e-07}
0.918 (+/-0.005) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 0.0001}
0.918 (+/-0.005) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-05}
0.918 (+/-0.005) for {'StochasticGD__alpha': 0.0001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'elasticnet', 'StochasticGD__tol': 1e-07}
Best accuracy score: 0.920
Best parameters set:
StochasticGD__alpha: 0.001
StochasticGD__loss: 'log'
StochasticGD__penalty: 'l1'
StochasticGD__tol: 1e-07
Top 2 GridSearch results: (accuracy, hyperparam Combo)
({'StochasticGD__alpha': 0.001, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 1e-07}, 0.9201405810290751)
({'StochasticGD__alpha': 0.1, 'StochasticGD__loss': 'log', 'StochasticGD__penalty': 'l1', 'StochasticGD__tol': 0.0001}, 0.9201405761309761)
Training accuracy by best pipeline is: 0.920
Testing accuracy by best pipeline is: 0.917
y_valid_pred = sgd_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (validation data)")
print(confusion_matrix(y_valid_pred, y_test))
print("------------------")
print(f"Overall accuracy (validation data): {np.round(accuracy_score(y_test, y_valid_pred), 3)*100}%")
print("------------------")
print(f"AUROC (validation data): {np.round(roc_auc_score(y_test, y_valid_pred), 3)*100}%")
print()
RocCurveDisplay.from_estimator(sgd_gridsearch.best_estimator_, X_test, y_test)
plt.show()
Confusion matrix (validation data) [[13757 1238] [ 2 3]] ------------------ Overall accuracy (validation data): 91.7% ------------------ AUROC (validation data): 50.1%
expLog.loc[len(expLog)] = ["Baseline model: Stochastic Gradient Descent"] + list(np.round(
[accuracy_score(y_train, sgd_gridsearch.best_estimator_.predict(X_train)),
accuracy_score(y_test, sgd_gridsearch.best_estimator_.predict(X_test)),
roc_auc_score(y_train, sgd_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
roc_auc_score(y_test, sgd_gridsearch.best_estimator_.predict_proba(X_test)[:, 1])], 3)) + [
""]
expLog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | Baseline model: Logistic regression | 0.92 | 0.918 | 0.729 | 0.733 | |
| 1 | Baseline model: Random Forest Model | 0.92 | 0.918 | 0.859 | 0.725 | |
| 2 | Baseline model: Stochastic Gradient Descent | 0.92 | 0.917 | 0.729 | 0.724 |
We have achieved a training accuracy of 92% and AUC score of 0.72
df_kaggle.to_csv('submission2.csv', index = False)
#from IPython.display import Image
Image(filename='/content/Kaggle_Submission2.png')
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import matplotlib.pyplot
import numpy as np
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, accuracy_score, roc_auc_score, confusion_matrix, RocCurveDisplay
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.neural_network import MLPClassifier
from time import time
expLog = pd.DataFrame(columns=["model_name", "Train Acc", "Valid Acc","Train AUC", "Valid AUC","Comment (optional)"])
application_train = pd.read_csv('application_train.csv')
#Subset dataframe
#subset application train dataset to improve performance
#Calculate the percentage of applicants given a loan or not
app1 = application_train[application_train['TARGET']==1]
app0 = application_train[application_train['TARGET']==0]
app1_len = app1.shape[0]
app0_len = app0.shape[0]
n = app1_len + app0_len
app1_len_proportion = app1_len/n
app0__len_proportion = app0_len/n
#subset rows from data
app1_sample = app1.sample(n=int(10000*app1_len_proportion))
app0_sample = app0.sample(n=int(10000*app0__len_proportion))
application_train = pd.concat([app1_sample,app0_sample]) # End Subset
application_train.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45133 | 152281 | 1 | Cash loans | M | N | Y | 0 | 270000.0 | 518562.0 | 22099.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 254732 | 394763 | 1 | Revolving loans | F | N | Y | 0 | 90000.0 | 247500.0 | 12375.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 |
| 254973 | 395034 | 1 | Cash loans | F | Y | Y | 1 | 135000.0 | 808650.0 | 26217.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 276524 | 420452 | 1 | Cash loans | M | Y | Y | 0 | 135000.0 | 832500.0 | 42507.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 117343 | 236080 | 1 | Cash loans | F | N | N | 0 | 157500.0 | 1006920.0 | 51412.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 122 columns
#Train Test Split
X = application_train.loc[:, application_train.columns != 'TARGET']
y = application_train['TARGET']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Create full set of categorical features
cat_features = X.select_dtypes(include=['object']).columns.tolist()
#Numeric Features
num_features = X.select_dtypes(include=np.number).columns.tolist()
# Check if all columns selected
if X.shape[1] == len(num_features) + len(cat_features): print("All columns have been selected")
else: print("All columns have not been selected, re-evaluate selection criteria")
All columns have been selected
# Create class for feature selection in the form of df columns
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# Pipeline for categorical and numeric features
cat_pipe = Pipeline([
('selector', DataFrameSelector(cat_features)),
('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))]) # ignore values from validation/test data that do NOT occur in training set
# Baseline pipeline for numerical features
num_pipe = Pipeline([
('selector', DataFrameSelector(num_features)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler())])
#Used ColumnTransformer as opposed to FeatureUnion in Phase1
data_prep_pipe = ColumnTransformer(transformers= [
# (name, transformer, columns)
("num_pipeline", num_pipe, num_features),
("cat_pipeline", cat_pipe, cat_features)],
remainder='drop',
n_jobs=-1)
from sklearn.neural_network import MLPClassifier
mlp_pipe = Pipeline([
("data_prep", data_prep_pipe),
('MLP', MLPClassifier(random_state=42))])
params = {'MLP__hidden_layer_sizes':[10,100, 500],
'MLP__alpha':[0.0001,0.001,0.01],
'MLP__activation':['identity','logistic','tanh','relu']}
mlp_gridsearch = GridSearchCV(mlp_pipe, param_grid = params, cv = 3, scoring='accuracy', n_jobs = -1)
print("Performing GS using Multi-Layered Perceptron")
print("Pipeline:", [name for name, _ in mlp_pipe.steps])
print("Parameters:")
print(params)
t0 = time()
mlp_gridsearch.fit(X_train, y_train)
print("Took %0.3fs to fit" % (time() - t0))
print("Best parameters set found on development set:")
print(mlp_gridsearch.best_params_)
print("Grid scores on development set:")
means = mlp_gridsearch.cv_results_['mean_test_score']
stds = mlp_gridsearch.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, mlp_gridsearch.cv_results_['params']): print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))
scoring='accuracy'
#Best accuracy score and best parameter combination
print("Best %s score: %0.3f" %(scoring, mlp_gridsearch.best_score_))
print("Best parameters set:")
best_parameters = mlp_gridsearch.best_estimator_.get_params()
for param_name in sorted(params.keys()): print("\t%s: %r" % (param_name, best_parameters[param_name]))
#Sort the grid search results in decreasing avg order
sortedGridSearchResults = sorted(zip(mlp_gridsearch.cv_results_["params"], mlp_gridsearch.cv_results_["mean_test_score"]), key=lambda x: x[1], reverse=True)
print(f'Top 2 GridSearch results: ({scoring}, hyperparam Combo)\n {sortedGridSearchResults[0]}\n {sortedGridSearchResults[1]}\n\n\n')
y_pred_train = mlp_gridsearch.predict(X_train)
print(f"Training accuracy by best pipeline is: {accuracy_score(y_pred_train, y_train):.3f}")
y_pred_test = mlp_gridsearch.predict(X_test)
print(f"Testing accuracy by best pipeline is: {accuracy_score(y_pred_test, y_test):.3f}")
Performing GS using Multi-Layered Perceptron
Pipeline: ['data_prep', 'MLP']
Parameters:
{'MLP__hidden_layer_sizes': [10, 100, 500], 'MLP__alpha': [0.0001, 0.001, 0.01], 'MLP__activation': ['identity', 'logistic', 'tanh', 'relu']}
Took 1221.743s to fit
Best parameters set found on development set:
{'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}
Grid scores on development set:
0.921 (+/-0.000) for {'MLP__activation': 'identity', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 10}
0.921 (+/-0.003) for {'MLP__activation': 'identity', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 100}
0.921 (+/-0.002) for {'MLP__activation': 'identity', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 500}
0.921 (+/-0.000) for {'MLP__activation': 'identity', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 10}
0.921 (+/-0.004) for {'MLP__activation': 'identity', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 100}
0.921 (+/-0.002) for {'MLP__activation': 'identity', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 500}
0.921 (+/-0.001) for {'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 10}
0.922 (+/-0.004) for {'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 100}
0.922 (+/-0.002) for {'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}
0.921 (+/-0.005) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 10}
0.914 (+/-0.006) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 100}
0.902 (+/-0.008) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 500}
0.921 (+/-0.004) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 10}
0.916 (+/-0.005) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 100}
0.903 (+/-0.010) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 500}
0.922 (+/-0.003) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 10}
0.918 (+/-0.005) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 100}
0.914 (+/-0.001) for {'MLP__activation': 'logistic', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}
0.904 (+/-0.008) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 10}
0.909 (+/-0.005) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 100}
0.911 (+/-0.003) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 500}
0.904 (+/-0.009) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 10}
0.910 (+/-0.005) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 100}
0.911 (+/-0.005) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 500}
0.905 (+/-0.009) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 10}
0.910 (+/-0.006) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 100}
0.909 (+/-0.006) for {'MLP__activation': 'tanh', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}
0.902 (+/-0.013) for {'MLP__activation': 'relu', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 10}
0.898 (+/-0.007) for {'MLP__activation': 'relu', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 100}
0.908 (+/-0.008) for {'MLP__activation': 'relu', 'MLP__alpha': 0.0001, 'MLP__hidden_layer_sizes': 500}
0.901 (+/-0.009) for {'MLP__activation': 'relu', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 10}
0.899 (+/-0.009) for {'MLP__activation': 'relu', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 100}
0.908 (+/-0.008) for {'MLP__activation': 'relu', 'MLP__alpha': 0.001, 'MLP__hidden_layer_sizes': 500}
0.901 (+/-0.008) for {'MLP__activation': 'relu', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 10}
0.899 (+/-0.007) for {'MLP__activation': 'relu', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 100}
0.905 (+/-0.003) for {'MLP__activation': 'relu', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}
Best accuracy score: 0.922
Best parameters set:
MLP__activation: 'identity'
MLP__alpha: 0.01
MLP__hidden_layer_sizes: 500
Top 2 GridSearch results: (accuracy, hyperparam Combo)
({'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 500}, 0.9217031004429205)
({'MLP__activation': 'identity', 'MLP__alpha': 0.01, 'MLP__hidden_layer_sizes': 100}, 0.9215602228889841)
Training accuracy by best pipeline is: 0.924
Testing accuracy by best pipeline is: 0.910
#Visualizing result
y_test_pred = mlp_gridsearch.best_estimator_.predict(X_test)
print("Confusion matrix (test data)")
print(confusion_matrix(y_test_pred, y_test))
print("------------------")
print(f"Overall accuracy (test data): {np.round(accuracy_score(y_test, y_test_pred), 3)*100}%")
print("------------------")
print(f"AUROC (test data): {np.round(roc_auc_score(y_test, y_test_pred), 3)*100}%")
RocCurveDisplay.from_estimator(mlp_gridsearch.best_estimator_, X_test, y_test)
plt.show()
Confusion matrix (test data) [[2722 262] [ 7 9]] ------------------ Overall accuracy (test data): 91.0% ------------------ AUROC (test data): 51.5%
expLog.loc[len(expLog)] = ["MLP"] + list(np.round(
[accuracy_score(y_train, mlp_gridsearch.best_estimator_.predict(X_train)),
accuracy_score(y_test, y_test_pred),
roc_auc_score(y_train, mlp_gridsearch.best_estimator_.predict_proba(X_train)[:, 1]),
roc_auc_score(y_test, y_test_pred)], 3)*100) + [
"MLP"]
expLog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | MLP | 92.4 | 91.0 | 77.7 | 51.5 | MLP |
expLog.to_csv("explog.csv", index=False, header = 1)
explog = pd.read_csv("explog.csv")
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
application_train = pd.read_csv('application_train.csv')
# load data
X = application_train.loc[:, application_train.columns != 'TARGET']
y = application_train['TARGET']
#Create full set of categorical features
cat_features = X.select_dtypes(include=['object']).columns.tolist()
#Numeric Features
num_features = X.select_dtypes(include=np.number).columns.tolist()
# Check if all columns selected
if X.shape[1] == len(num_features) + len(cat_features): print("All columns have been selected")
else: print("All columns have not been selected, re-evaluate selection criteria")
All columns have been selected
# Create class for feature selection in the form of df columns
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
cat_pipe = Pipeline([
('selector', DataFrameSelector(cat_features)),
('imputer', SimpleImputer(strategy='most_frequent', fill_value='missing')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))]) # ignore values from validation/test data that do NOT occur in training set
# Baseline pipeline for numerical features
num_pipe = Pipeline([
('selector', DataFrameSelector(num_features)),
('imputer', SimpleImputer(strategy='mean')),
('std_scaler', StandardScaler())])
#Used ColumnTransformer as opposed to FeatureUnion in Phase1
data_prep_pipe = ColumnTransformer(transformers= [
# (name, transformer, columns)
("num_pipeline", num_pipe, num_features),
("cat_pipeline", cat_pipe, cat_features)],
remainder='drop',
n_jobs=-1)
#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#Fit X_train, X_test through prep pipeline
transformed_train = data_prep_pipe.fit_transform(X_train)
transformed_test = data_prep_pipe.fit_transform(X_test)
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()
# convert numpy arrays to tensors
X_train_tensor = torch.from_numpy(transformed_train)
X_test_tensor = torch.from_numpy(transformed_test)
y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)
from sklearn.metrics import auc, roc_curve
# create TensorDataset in PyTorch
app_train = torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor)
app_test = torch.utils.data.TensorDataset(X_test_tensor, y_test_tensor)
# create dataloader
# DataLoader is implemented in PyTorch, which will return an iterator to iterate training data by batch.
batch_size = 20 # batch_size is the size of each batch in which data returned
trainloader = torch.utils.data.DataLoader(app_train, batch_size=batch_size, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(app_test, batch_size=transformed_test.shape[0], shuffle=False, num_workers=2)
D_in = transformed_test.shape[1]
D_hidden = 20
D_out = 2
# Use the nn package to define our model and loss function.
# use the sequential API makes things simple
model = torch.nn.Sequential(
torch.nn.Linear(D_in, D_hidden),
torch.nn.ReLU(),
torch.nn.Linear(in_features=D_hidden, out_features=D_out))
model.to(device)
# use Cross Entropy and SGD optimizer.
loss_fn = torch.nn.CrossEntropyLoss() #for classfication
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
#summary(model, (4, 20))
print('------------------------------')
print('Model:')
print(model)
print('------------------------------')
epochs = range(5)
'''
Training Process:
Load a batch of data.
Zero the grad.
Predict the batch of the data through net i.e forward pass.
Calculate the loss value by predict value and true value.
Backprop i.e get the gradient with respect to parameters
Update optimizer
'''
train_losses = dict()
train_accuracy = dict()
for epoch in epochs:
running_loss = list()
y_pred = list()
epoch_target = list()
for batch, data in enumerate(trainloader):
inputs, target = data[0].to(device), data[1].to(device)
# Clear gradient buffers because we don't want any gradient from previous epoch to carry forward, dont want to cummulate gradients
optimizer.zero_grad()
# do forward pass
output = model(inputs.float())
# compute loss and gradients
loss = loss_fn(output, target)
# get gradients w.r.t to parameters
loss.backward()
# perform gradient update
optimizer.step()
y_pred.extend(torch.argmax(output, dim=1).tolist())
epoch_target.extend(target.tolist())
running_loss.append(loss.item())
epoch_training_loss = np.mean(running_loss)
train_losses[epoch+1] = epoch_training_loss
print(f"Epoch {epoch+1}, Training loss: {np.round(epoch_training_loss, 3)}")
#accuracy
correct = (np.array(y_pred) == np.array(epoch_target))
accuracy = correct.sum()/ correct.size
train_accuracy[epoch+1] = accuracy
print(f"Epoch {epoch+1}, Training accuracy: {np.round(accuracy, 3)}")
fpr, tpr, thresholds = roc_curve(np.array(epoch_target), np.array(y_pred), pos_label=1)
acu_train =auc(fpr, tpr)
print('Finished Training!')
------------------------------ Model: Sequential( (0): Linear(in_features=245, out_features=20, bias=True) (1): ReLU() (2): Linear(in_features=20, out_features=2, bias=True) ) ------------------------------ Epoch 1, Training loss: 0.258 Epoch 1, Training accuracy: 0.918 Epoch 2, Training loss: 0.252 Epoch 2, Training accuracy: 0.919 Epoch 3, Training loss: 0.251 Epoch 3, Training accuracy: 0.919 Epoch 4, Training loss: 0.25 Epoch 4, Training accuracy: 0.919 Epoch 5, Training loss: 0.25 Epoch 5, Training accuracy: 0.919 Finished Training!
import torch.nn.functional as nnf
test_batch_losses = list()
test_y_pred = list()
test_target = list()
for batch, data in enumerate(testloader):
inputs, target = data[0].to(device), data[1].to(device)
# do forward pass
output = model(inputs.float())
# compute loss
loss = loss_fn(output, target)
test_batch_losses.append(loss.item())
test_y_pred.extend(torch.argmax(output, dim=1).tolist())
test_target.extend(target.tolist())
#accuracy
test_correct = (np.array(test_y_pred) == np.array(test_target))
test_accuracy = test_correct.sum()/ test_correct.size
print(f"Test accuracy: {np.round(test_accuracy, 3)}")
fpr, tpr, thresholds = roc_curve(np.array(epoch_target), np.array(y_pred), pos_label=1)
auc_test =auc(fpr, tpr)
Test accuracy: 0.92
explog.loc[len(explog)] = ["MLP PyTorch"] + list(np.round(
[accuracy, test_accuracy, acu_train, auc_test], 3)*100) + [
"MLP - one hidden layer"]
explog
| model_name | Train Acc | Valid Acc | Train AUC | Valid AUC | Comment (optional) | |
|---|---|---|---|---|---|---|
| 0 | MLP | 92.4 | 91.0 | 77.7 | 51.5 | MLP |
| 1 | MLP PyTorch | 91.9 | 92.0 | 50.0 | 50.0 | MLP - one hidden layer |
In phase 2 of our HCDR project, we have worked on feature engineering and selection, model selection and hyperparameter tuning. We have built several pipelines as clean data-flow framework for our cross-validation-based decision-making process on ML algorithms and tunable parameters. Not entirely surprisingly, but still quite troublesome, we have faced significant challenges introduced by the size of our data and resulting computational demands forced us to scale back and adjust on our plans we originally had for this phase. We have managed to achieve an traning accuracy of 92% and AUC score of 0.73 for Random forest which is the best model implemented so far. To further improve on this, we plan a variety of model and feature (selection) refinements in the next phase, also looking for potential remedies on data size related challenges.